• Open

    How Metadata Improves Security, Quality, and Transparency
    How does Spotify battle against a giant like Apple? One word: data. With machine learning and AI, Spotify creates value for its users by providing a more personalized and bespoke experience. Let’s take a quick look at the layers of aggregate information that are used to enhance their platform: Spotify uses natural language processing (NLP)… Read More »How Metadata Improves Security, Quality, and Transparency The post How Metadata Improves Security, Quality, and Transparency appeared first on Data Science Central.  ( 5 min )
    Business Analytics from Application Logs and SQL Server Database using Splunk
    Introduction/ Problem  Splunk is a well-known log management tool. Splunk mines log from different machines in real-time and can be used to monitor, search, and analyze gathered data. It is a Big Data log management tool that can give insight from the unstructured data stored in the Splunk indexes. Splunk analytics helps turn unstructured log… Read More »Business Analytics from Application Logs and SQL Server Database using Splunk  The post Business Analytics from Application Logs and SQL Server Database using Splunk  appeared first on Data Science Central.  ( 11 min )
    Tips for Weaving and Implementing a Successful Data Mesh
    For years, organizations have been working steadily to consolidate their data into a single place via a data warehouse, or more recently, a data lake. Data lakes offer key advantages over data warehouses, data marts, and traditional databases that require data to be structured and organized in particular ways. However, businesses have found that they… Read More »Tips for Weaving and Implementing a Successful Data Mesh The post Tips for Weaving and Implementing a Successful Data Mesh appeared first on Data Science Central.  ( 5 min )
    How to Modernize Enterprise Data and Analytics Platform (Part 2 of 4)
    In many cases, for an enterprise to build its digital business technology platform, it must modernize its traditional data and analytics architecture. A modern data and analytics platform should be built on services-based principles and architecture. Introduction part 1, provided a conceptual-level reference architecture of a traditional Data and Analytics (D&A) platform. This part, provides… Read More »How to Modernize Enterprise Data and Analytics Platform (Part 2 of 4) The post How to Modernize Enterprise Data and Analytics Platform (Part 2 of 4) appeared first on Data Science Central.  ( 16 min )
    AI SEO: How AI Helps You Optimize Content for Search Results
    AI has been making its way into the marketing world over the last few years. Businesses have touted it as the solution to their problems and a way to incorporate technology into their processes. But, how is AI changing SEO? How can you use AI to improve your business? Machine learning and artificial intelligence are… Read More »AI SEO: How AI Helps You Optimize Content for Search Results The post AI SEO: How AI Helps You Optimize Content for Search Results appeared first on Data Science Central.  ( 4 min )
    Understanding the Role of Augmented Data Catalogs in Data Governance
    Every time we think we have grasped a new technology and its use. Sometimes that shift is an increase in the technology itself that seemingly intensifies the original version. Sometimes something happens that causes a significant transformation in the technology’s nature. As the technology’s significance is increasingly understood, the name is altered to better reflect… Read More »Understanding the Role of Augmented Data Catalogs in Data Governance The post Understanding the Role of Augmented Data Catalogs in Data Governance appeared first on Data Science Central.  ( 4 min )
    Security Issues in The Metaverse
    Issues that will affect the metaverse range from hacking to cyberbullying. Proposed solutions are far from perfect. We may see end-to-end encryption disappear soon. The only real solution might be not to enter the metaverse at all. The advent of the metaverse compounds a list of existing security concerns. The addition of another dimension to… Read More »Security Issues in The Metaverse The post Security Issues in The Metaverse appeared first on Data Science Central.  ( 4 min )
  • Open

    Easily migrate your IVR flows to Amazon Lex using the IVR migration tool
    This post was co-written by John Heater, SVP of the Contact Center Practice at NeuraFlash. NeuraFlash is an Advanced AWS Partner with over 40 collective years of experience in the voice and automation space. With a dedicated team of conversation designers, data engineers, and AWS developers, NeuraFlash helps customers take advantage of the power of Amazon […]  ( 9 min )
    How Amazon Search achieves low-latency, high-throughput T5 inference with NVIDIA Triton on AWS
    Amazon Search’s vision is to enable customers to search effortlessly. Our spelling correction helps you find what you want even if you don’t know the exact spelling of the intended words. In the past, we used classical machine learning (ML) algorithms with manual feature engineering for spelling correction. To make the next generational leap in […]  ( 7 min )
  • Open

    [D] Relative error metrics comparing Model complexity versus error
    Is anyone aware of metrics or measures that classify the error relative the complexity of the model. If so could anyone provide some references where people have introduced such a notion? submitted by /u/AbjectDrink3276 [link] [comments]
    [R] Google Research: Self-Consistency Improves Chain of Thought Reasoning in Language Models
    submitted by /u/ReasonablyBadass [link] [comments]
    [P][R] Hear what you see: Autoencoder to convert video to audio
    GitHub: https://github.com/muxamilian/wysiwyh WYSIWHY is a neural network that transforms an input video into an audio sequence in real time. It does so by compressing each image using an autoencoder and interpreting the resulting code as a frequency range. This could potentially be useful for the visually impaired, helping with indoor navigation. It could also be used to transform an infrared or ultraviolet video to sound and thus enable one to perceive otherwise invisible colors. Demo; recommended with sound turned on In a usual autoencoder, the elements of the code vector are independent of each other. This makes subtle differences between adjacent elements of the code vector hard to perceive for humans. The resulting audio sequence would sound like indistinguishable white noise. For WYSIWYH thus an encoder is used which hierarchically structures the code, meaning that large differences in the input image result in large differences in the output code. This also results in adjacent elements of the code vector being correlated. This makes the autoencoder’s code more friendly to human perception. Overall approach submitted by /u/muxamilian [link] [comments]  ( 2 min )
    [R] Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis
    https://arxiv.org/abs/2203.10977 submitted by /u/gaggi_94 [link] [comments]
    [D] STUMPY v1.11.0 Released for Modern Time Series Analysis
    We're happy to announce the release of STUMPY v1.11.0! This version includes the oft requested Minkowski (p-norm) Distance, support for Multi-dimensional Motif Discovery, new Annotation vector tutorials, and enhancements for Pan Matrix Profiles! https://github.com/TDAmeritrade/stumpy submitted by /u/slaw07 [link] [comments]  ( 1 min )
    [N] Call for submissions: Foundations of Digital Games 2022, 5-8 September (Athens, GR & hybrid)
    FDG2022: Foundations of Digital Games 2022Athens, Greece, September 5-8, 2022Conference website: http://fdg2022.org/ Foundations of Digital Games (FDG) 2022 invites research contributions in the form of papers, posters and demos, doctoral consortium applications, as well as panel, competition, and workshop proposals. We invite contributions from within and across any discipline committed to advancing knowledge on the foundations of games: computer science and engineering, humanities and social sciences, arts and design, mathematics and natural sciences. As was the case in the previous years, we aim to publish the FDG 2022 proceedings in the ACM Digital Library. ​FDG invites authors to submit short or full papers reporting new research. Both short and full papers need to be anonymized and…  ( 1 min )
    [R] JAX meets Flower - Federated Learning with JAX
    Flower and its community are growing. Since Flower is a friendly federated learning framework, the goal is always to get an easy start to federated learning for every data scientist. This involves having Flower coding examples for different machine learning frameworks. One of the frameworks is JAX which was developed by Google researchers to run NumPy programs on GPUs and TPUs. It is quickly rising in popularity and is used by DeepMind to support and accelerate its research. We couldn’t miss the opportunity to create a code example and a blog post about “JAX meets Flower - Federated Learning with JAX”. It takes always some time to get into a new machine learning framework and its syntax but it is easy to combine it with Flower. You can check out the blog post here: https://flower.dev/blog/2022-03-22-running-jax-federated-jax-meets-flower/ If you are interested in writing a blog post for our Flower community, contact me. submitted by /u/burnai [link] [comments]  ( 1 min )
    [D] With experience deploying batch models, how to learn about streaming?
    Hi! I have some experience deploying batch machine learning models and now I want to learn about real-time models. More specifically, how to put them in production and what are the best practices and tools for different use-cases. Any ideas? I was thinking of reading the book "Designing Event-Driven Systems" by Ben Stopford (I think it's based on Kafka which seems quite popular), but would like to hear your thoughts or if someone has any other reference. Thanks and I hope this is the right sub! submitted by /u/Silver_Book_938 [link] [comments]  ( 1 min )
    [P] A Social Platform to Rate and Review Machine Learning Research Papers
    We developed a social platform for everyone to rate and review machine learning research papers. One can think of it as the IMDB database for the research community. Link: https://doublind.com Features we think that are important: Search all machine learning related papers, old and new. Rate and review the paper you've just read, open or anonymous. A profile page is created for all users to display their paper reviews. Each paper has a rating score that help users find good papers to read. Users can like, comment and share a paper review, helping it reaching more readers. ​ We'd love to hear your feedback and suggestions. Thank you all and we appreciate the support. submitted by /u/DouBlindDotCOM [link] [comments]  ( 4 min )
    [D] always behind Sota
    How do you guys deal with the stress of developing a new models and being off from SOTA but also needing to publish papers? Do people publish results that aren’t quite SOTA? Do you tweak results? Do you do something else? submitted by /u/AbjectDrink3276 [link] [comments]  ( 3 min )
    [R] Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations
    submitted by /u/hardmaru [link] [comments]
    [D] Will a PhD in Biomedical Engineering limit opportunities if I want to become a research scientist in ML?
    I’m a tech in an academic computational lab at a very large US flagship R1 university. My current lab is a neuroscience lab that does lots of ML theory stuff (bio inspired ML). My primary goal is to get a PhD and find a research scientist position in industry at a non-FAANG company. Because my PI is a neuroscientist, I might not be able to work with him as a PhD student if I apply to the EE or CS program at this university, but he has an affiliate appointment at the Biomedical Engineering department. Would having a biomedical engineering degree in any way affect my ability to get a research scientist ML job? As long as I have a productive PhD with ML publications etc., and get internships at relevant places, I should be fine right? Will I have a worse chance at non-biotech companies? If BME is an issue, I’m sure my PI can get affiliate status in EE, because he has multiple collaborators in EE. submitted by /u/throawaythroaway11 [link] [comments]  ( 5 min )
  • Open

    Dopamine Helps Signal to Neurons When to Start a Movement
    submitted by /u/gwern [link] [comments]
    LSTM encoder in the policy?
    Hi all, I'm trying to code an actor net that consists of an LSTM followed by two fully connected layers. The policy is therefore parameterized using an LSTM. The LSTM is meant to enable agents to remember the history of states in which they have observed other agents, and improve their ability to model other agents’ policies. Can someone refer me to PyTorch code for such an architecture? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How Hugging Face 🤗 can contribute to the Deep Reinforcement Learning Ecosystem?
    Hey there! 👋 I'm Thomas Simonini from Hugging Face 🤗. I work on building tools, environments and integrating RL libraries to empower researchers and RL enthusiasts. I was wondering how Hugging Face can be useful to you in the Deep Reinforcement Learning Ecosystem? What do you need as RL researcher/enthusiast/engineer and how we can help you? For now: We integrated Stable-baselines3 to the Hub** such that you can: Easily host and test your saved models. Load powerful, trained models from the community https://preview.redd.it/n0b2s1gndyo81.jpg?width=1920&format=pjpg&auto=webp&s=eb62a1f4323b12a5c1eb9d7bcb44ebbc6bae579f We're currently integrating more libraries (RL-Zoo, CleanRL...) We're working on building tools that allow you to generate a replay video of your agent and test it. We're building open-source RL environments such as snowball-fight And finally, we're working on state of the art's research with Decision Transformers, Embodied environments, etc. But I would love to know what do you need as RL researcher/enthusiast/engineer and how we can help you? Thanks for your feedback, 📢 To keep in touch is to join our discord server to exchange with us and with the community. submitted by /u/cranthir_ [link] [comments]  ( 3 min )
    how to set the distance_threshold value and make it effective
    I use the [RL Robotics library](https://github.com/Farama-Foundation/Gym-Robotics) to simulate robots. In the Fetchxxx-Env, the variable distance_threshold is set as 0.05 default to determine whether it completed successfully a task. I try to change it by using env.distance_threshold = 0.01(or other value) Because I think 0.05(m) is too rough for some tasks. But it doesn't work! (When the success_rate is 1.0, the distance between the achieved_goal and the desired_goal is 0.04x, which is larger than I set.) What should I do to change the conditions for determining whether a task is complete? Can anyone give me some advice or information? submitted by /u/Due_Advertising6542 [link] [comments]  ( 1 min )
    Need help trying to implement a paper
    Hello, I'm an undergraduate CS student trying to re-implement this paper for my University project. I have some experience in Machine Learning (Supervised Learning) and implemented some basic projects myself, but I'm a little bit confused about where to begin, I've found two different implementations online but I don't think that's gonna help me a lot as I'm not that experienced in this field. I would appreciate any suggestions on how I can understand the paper and implement it myself (not just a copy/paste from someone else's code) or the libraries that can help me, etc. Sorry for my bad English, I'm not a native speaker thanks a lot :) submitted by /u/ahmadreza_hadi [link] [comments]  ( 1 min )
    Robotics control : DRL vs classical approaches
    I'm curious about your opinion on this. There are impressive recent advances in DRL for humanoid whole-body control (Deepmimic, Drecon, etc...). Do you think these methods will one day overtake the "classical" approaches from control theory? Or we will see more and more methods combining these two approaches? submitted by /u/thejackforest [link] [comments]  ( 2 min )
  • Open

    "We've got incredible potential here to use the Sophiaverse virtual world, not just as a fun way for people to play or make money by selling NFTs they create, but also as a way to get human values and culture and the pragmatics of human interaction across to AI."- Ben Goertzel, CEO at SingularityNET
    submitted by /u/thedyezwfl [link] [comments]  ( 1 min )
    Google Research Proposes MaskGIT: A New Deep Learning Technique Based on Bi-Directional Generative Transformers For High-Quality and Fast Image Synthesis
    Generative Adversarial Networks (GANs), with their capacity of producing high-quality images, have been the leading technology in image generation for the past couple of years. Nevertheless, their minimax learning mechanism brought out different limits, such as training instability and mode collapse (i.e., when all the produced samples belong to a small set of samples). Recently, Generative Transformer models are beginning to match, or even surpass, the performances of GANs. The simple idea is to learn a function to encode the input image into a quantized sequence and then train an autoregressive Transformer on a sequence prediction task (i.e., predict an image token, given all the previous image tokens). This learning is based on maximum likelihood and thus not affected by the same issue…  ( 2 min )
    I decided to make roads out of the Lorenz attractor…
    submitted by /u/TheRealMandelbrotSet [link] [comments]
    Nvidia unveils latest chip to speed up AI computing
    submitted by /u/nonaime7777777 [link] [comments]
    Software Company Releases Chrome Extension That Detects AI-Generated Profile Pictures
    submitted by /u/nonaime7777777 [link] [comments]
    AI Combines With Decentralized Autonomous Organization
    submitted by /u/getrich_or_diemining [link] [comments]
    Alphabet is Spinning out Division X 'Sandbox AQ'
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    This Is The Reason Why Sleep Robots Are Becoming The Perfect Bed Mates
    submitted by /u/sopadebombillas [link] [comments]
    Unraveling the Mystery Behind Background Filters in Video Calling Apps
    submitted by /u/VikasOjha666 [link] [comments]
    Artificial Intelligence in the News (the period of March 21st, 2022)
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Best Machine Learning Books to read in 2022 -
    submitted by /u/sivasiriyapureddy [link] [comments]
    Could it be, that people spend a lot of money on AIs, which are generating realistic images and then sell the copyright of these images?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    Nebullvm, an open-source library to accelerate AI inference by 5-20x in a few lines of code
    How does nebullvm work? It takes your AI model as input and outputs an optimized version that runs 5-20 times faster on your hardware. In other words, nebullvm tests multiple deep learning compilers to identify the best possible way to execute your model on your specific machine, without impacting the accuracy of your model. And that's it. In just a few lines of code. And a big thank you to everyone for supporting this open-source project! The library received 250+ Github stars⭐ on release day, and that's just amazing 🚀 ORIENTATION MAP Let's learn more about nebullvm and AI optimization. Where should we start? From... Some CONTEXT on why few developers optimize AI and related negative consequences An overview of how the LIBRARY works Some USE CASES, technology demonstrations and…  ( 6 min )
    Are there any ai's to try and recreate the voice you hear in your head?
    Im wondering if there is any ai that you can feed your voice to and it will try to make an approximation of what you sound like to yourself in your head? submitted by /u/alidan [link] [comments]  ( 1 min )
    Real-Time (Language) Model
    submitted by /u/BeginningInfluence55 [link] [comments]  ( 2 min )
    Looking for a Speaker for a Webinar about Artificial Intelligence
    Hi guys!, we are University Students and we're currently looking for a speaker that can discuss Artificial Intelligence on an introductory level and the Opportunities it can bring for Engineering Students. Thank you! Please DM asap me for more details. submitted by /u/BigBadToilet [link] [comments]
    Weekly China AI Newsletter: Cambricon Stocks Plunge After Sudden Exit of Huawei Veteran CTO; US-Banned DeepGlint Goes Public; China-US AI Collaboration Rises Fivefold Since 2010
    submitted by /u/trcytony [link] [comments]  ( 1 min )
  • Open

    NVIDIA Omniverse Upgrade Delivers Extraordinary Benefits to 3D Content Creators
    At GTC, NVIDIA announced significant updates for millions of creators using the NVIDIA Omniverse real-time 3D design collaboration platform. The announcements kicked off with updates to the Omniverse apps Create, Machinima and Showroom, with an immement View release. Powered by GeForce RTX and NVIDIA RTX GPUs, they dramatically accelerate 3D creative workflows. New Omniverse Connections Read article > The post NVIDIA Omniverse Upgrade Delivers Extraordinary Benefits to 3D Content Creators appeared first on NVIDIA Blog.  ( 5 min )
    At GTC: NVIDIA RTX Professional Laptop GPUs Debut, New NVIDIA Studio Laptops, a Massive Omniverse Upgrade and NVIDIA Canvas Update
    Digital artists and creative professionals have plenty to be excited about at NVIDIA GTC. Impressive NVIDIA Studio laptop offerings from ASUS and MSI launch with upgraded RTX GPUs, providing more options for professional content creators to elevate and expand creative possibilities. NVIDIA Omniverse gets a significant upgrade — including updates to the Omniverse Create, Machinima Read article > The post At GTC: NVIDIA RTX Professional Laptop GPUs Debut, New NVIDIA Studio Laptops, a Massive Omniverse Upgrade and NVIDIA Canvas Update appeared first on NVIDIA Blog.  ( 6 min )
    Keynote Wrap Up: Turning Data Centers into ‘AI Factories,’ NVIDIA CEO Intros Hopper Architecture, H100 GPU, New Supercomputers, Software
    Promising to transform trillion-dollar industries and address the “grand challenges” of our time, NVIDIA founder and CEO Jensen Huang Tuesday shared a vision of an era where intelligence is created on an industrial scale and woven into real and virtual worlds. Kicking off NVIDIA’s GTC conference, Huang introduced new silicon — including the new Hopper Read article > The post Keynote Wrap Up: Turning Data Centers into ‘AI Factories,’ NVIDIA CEO Intros Hopper Architecture, H100 GPU, New Supercomputers, Software appeared first on NVIDIA Blog.  ( 8 min )
    Unlimited Data, Unlimited Possibilities: UF Health and NVIDIA Build World’s Largest Clinical Language Generator
    The University of Florida’s academic health center, UF Health, has teamed up with NVIDIA to develop a neural network that generates synthetic clinical data — a powerful resource that researchers can use to train other AI models in healthcare. Trained on a decade of data representing more than 2 million patients, SynGatorTron is a language Read article > The post Unlimited Data, Unlimited Possibilities: UF Health and NVIDIA Build World’s Largest Clinical Language Generator appeared first on NVIDIA Blog.  ( 4 min )
    New NVIDIA RTX GPUs Tackle Demanding Professional Workflows and Hybrid Work, Enabling Creation From Anywhere
    Remote work and hybrid workplaces are the new normal for professionals in many industries. Teams spread throughout the world are expected to create and collaborate while maintaining top productivity and performance. Businesses use the NVIDIA RTX platform to enable their workers to keep up with the most demanding workloads, from anywhere. And today, NVIDIA is Read article > The post New NVIDIA RTX GPUs Tackle Demanding Professional Workflows and Hybrid Work, Enabling Creation From Anywhere appeared first on NVIDIA Blog.  ( 4 min )
    First Wave of Startups Harnesses UK’s Most Powerful Supercomputer to Power Digital Biology Breakthroughs
    Four NVIDIA Inception members have been selected as the first cohort of startups to access Cambridge-1, the U.K.’s most powerful supercomputer. The system will help British companies Alchemab Therapeutics, InstaDeep, Peptone and Relation Therapeutics enable breakthroughs in digital biology. Officially launched in July, Cambridge-1 — an NVIDIA DGX SuperPOD cluster powered by NVIDIA DGX A100 Read article > The post First Wave of Startups Harnesses UK’s Most Powerful Supercomputer to Power Digital Biology Breakthroughs appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Omniverse Ecosystem Expands 10x, Amid New Features and Services for Developers, Enterprises and Creators
    When it comes to creating and connecting virtual worlds, over 150,000 individuals have downloaded NVIDIA Omniverse to make huge leaps in transforming 3D design workflows and achieve new heights of real-time, physically accurate simulations. At GTC, NVIDIA today announced new releases and updates for Omniverse — including the latest Omniverse Connectors and libraries — expanding Read article > The post NVIDIA Omniverse Ecosystem Expands 10x, Amid New Features and Services for Developers, Enterprises and Creators appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Unveils Isaac Nova Orin to Accelerate Development of Autonomous Mobile Robots
    Next time socks, cereal or sandpaper shows up in hours delivered to your doorstep, consider the behind-the-scenes logistics acrobatics that help get them there so fast. Order fulfillment is a massive industry of moving parts. Heavily supported by autonomous mobile robots (AMRs), warehouses can span 1 million square feet, expanding and reconfiguring to meet demands. Read article > The post NVIDIA Unveils Isaac Nova Orin to Accelerate Development of Autonomous Mobile Robots appeared first on NVIDIA Blog.  ( 3 min )
    Driving on Air: Lucid Group Builds Intelligent EVs on NVIDIA DRIVE
    Lucid Group may be a newcomer to the electric vehicle market, but its entrance has been grand. The electric automaker announced at GTC that its current and future fleets are built on NVIDIA DRIVE Hyperion for programmable, intelligent capabilities. By developing on the scalable, software-defined platform, Lucid ensures its vehicles are always at the cutting Read article > The post Driving on Air: Lucid Group Builds Intelligent EVs on NVIDIA DRIVE appeared first on NVIDIA Blog.  ( 2 min )
    NVIDIA DRIVE Continues Industry Momentum With $11 Billion Pipeline as DRIVE Orin Enters Production
    NVIDIA DRIVE Hyperion and DRIVE Orin are gaining ground in the industry. At NVIDIA GTC, BYD, the world’s second-largest electric vehicle maker, announced it is building its next-generation fleets on the DRIVE Hyperion architecture. This platform, based on DRIVE Orin, is now in production, and powering a wide ecosystem of 25 EV makers building software-defined Read article > The post NVIDIA DRIVE Continues Industry Momentum With $11 Billion Pipeline as DRIVE Orin Enters Production appeared first on NVIDIA Blog.  ( 3 min )
    Announcing NVIDIA DRIVE Map: Scalable, Multi-Modal Mapping Engine Accelerates Deployment of Level 3 and Level 4 Autonomy
    With a detailed knowledge of the world and everything in it, maps provide the foresight AI uses to make advanced and safe driving decisions. At his GTC keynote, NVIDIA founder and CEO Jensen Huang introduced NVIDIA DRIVE Map, a multimodal mapping platform designed to enable the highest levels of autonomy while improving safety. It combines Read article > The post Announcing NVIDIA DRIVE Map: Scalable, Multi-Modal Mapping Engine Accelerates Deployment of Level 3 and Level 4 Autonomy appeared first on NVIDIA Blog.  ( 4 min )
    Introducing NVIDIA DRIVE Hyperion 9: Next-Generation Platform for Software-Defined Autonomous Vehicle Fleets
    NVIDIA DRIVE Hyperion is taking software-defined vehicle architectures to the next level. At his GTC keynote, NVIDIA founder and CEO Jensen Huang announced DRIVE Hyperion 9, the next generation of the open platform for automated and autonomous vehicles. The programmable architecture, slated for 2026 production vehicles, is built on multiple DRIVE Atlan computers to achieve Read article > The post Introducing NVIDIA DRIVE Hyperion 9: Next-Generation Platform for Software-Defined Autonomous Vehicle Fleets appeared first on NVIDIA Blog.  ( 3 min )
    Siemens Gamesa Taps NVIDIA Digital Twin Platform for Scientific Computing to Accelerate Clean Energy Transition
    Siemens Gamesa Renewable Energy is working with NVIDIA to create physics-informed digital twins of wind farms — groups of wind turbines used to produce electricity. The company has thousands of turbines around the globe that light up schools, homes, hospitals and factories with clean energy. In total they generate over 100 gigawatts of wind power, Read article > The post Siemens Gamesa Taps NVIDIA Digital Twin Platform for Scientific Computing to Accelerate Clean Energy Transition appeared first on NVIDIA Blog.  ( 3 min )
    NVIDIA Unveils Onramp to Hybrid Quantum Computing
    We’re working with leaders in quantum computing to build the tools developers will need to program tomorrow’s ultrahigh performance systems. Today’s high-performance computers are simulating quantum computing jobs at scale and with performance far beyond what’s possible on today’s smaller and error-prone quantum systems. In this way, classical HPC systems are helping quantum researchers chart Read article > The post NVIDIA Unveils Onramp to Hybrid Quantum Computing appeared first on NVIDIA Blog.  ( 3 min )
    Speed Dialer: How AT&T Rings Up New Opportunities With Data Science
    AT&T’s wireless network connects more than 100 million subscribers from the Aleutian Islands to the Florida Keys, spawning a big data sea. Abhay Dabholkar runs a research group that acts like a lighthouse on the lookout for the best tools to navigate it. “It’s fun, we get to play with new tools that can make Read article > The post Speed Dialer: How AT&T Rings Up New Opportunities With Data Science appeared first on NVIDIA Blog.  ( 3 min )
    NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions
    The NVIDIA Hopper GPU architecture unveiled today at GTC will accelerate dynamic programming — a problem-solving technique used in algorithms for genomics, quantum  computing, route optimization and more — by up to 40x with new DPX instructions. An instruction set built into NVIDIA H100 GPUs, DPX will help developers write code to achieve speedups on Read article > The post NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions appeared first on NVIDIA Blog.  ( 3 min )
    H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy
    The largest AI models can require months to train on today’s computing platforms. That’s too slow for businesses. AI, high performance computing and data analytics are growing in complexity with some models, like large language ones, reaching trillions of parameters. The NVIDIA Hopper architecture is built from the ground up to accelerate these next-generation AI Read article > The post H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy appeared first on NVIDIA Blog.  ( 4 min )
    NVIDIA Maxine Reinvents Real-Time Communication With AI
    Everyone wants to be heard. And with more people than ever in video calls or live streaming from their home offices, rich audio free from echo hiccups and background noises like barking dogs is key to better sounding online experiences. NVIDIA Maxine offers GPU-accelerated, AI-enabled software development kits to help developers build scalable, low-latency audio Read article > The post NVIDIA Maxine Reinvents Real-Time Communication With AI appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Microsoft Translator enhanced with Z-code Mixture of Experts models
    Translator, a Microsoft Azure Cognitive Service, is adopting Z-code Mixture of Experts models, a breakthrough AI technology that significantly improves the quality of production translation models. As a component of Microsoft’s larger XYZ-code initiative to combine AI models for text, vision, audio, and language, Z-code supports the creation of AI systems that can speak, see, […] The post Microsoft Translator enhanced with Z-code Mixture of Experts models appeared first on Microsoft Research.  ( 5 min )
    Powering the next generation of trustworthy AI in a confidential cloud using NVIDIA GPUs
    Cloud computing is powering a new age of data and AI by democratizing access to scalable compute, storage, and networking infrastructure and services. Thanks to the cloud, organizations can now collect data at an unprecedented scale and use it to train complex models and generate insights.   While this increasing demand for data has unlocked new […] The post Powering the next generation of trustworthy AI in a confidential cloud using NVIDIA GPUs appeared first on Microsoft Research.  ( 9 min )
  • Open

    Ranking System
    I am making a mobile app which contains newsfeed page, where I want to show the users post in ranking order, posts can be ranked based on the post content and the user rating count and feedback who has posted it, I don't want to spend more time on building a ranking system so what way or pathway I can take? Is there any API, system or anything already built which I can integrate into my app directly to rank post? submitted by /u/aliazlanaziz [link] [comments]  ( 1 min )
    NN AI vs projectile motion physics
    submitted by /u/hotcodist [link] [comments]
  • Open

    Multiple Frequency Shift Keying
    A few days ago I wrote about Frequency Shift Keying (FSK), a way to encode digital data in an analog signal using two frequencies. The extension to multiple frequencies is called, unsurprisingly, Multiple Frequency Shift Keying (MFSK). What is surprising is how MFSK sounds. When I first heard MFSK I immediately recognized it as an […] Multiple Frequency Shift Keying first appeared on John D. Cook.  ( 3 min )
    Dial tone and busy signal
    Henry Lowengard left a comment on my post Phone tones in musical notation mentioning dial tones and busy signals, so I looked these up. Tones According to this page, a dial tone in DTMF [1] is a chord of two sine waves at 350 Hz and 440 Hz. In musical notation: According to the same […] Dial tone and busy signal first appeared on John D. Cook.  ( 1 min )
  • Open

    Digit recogniser — Artificial Neural Network (ANN)
    Using the Artificial Neural Network (ANN) to make a churn model, we will create a model that predicts a handwritten digit. (with source…  ( 3 min )

  • Open

    A mini-conversation with Albert Einstein's AI persona
    submitted by /u/kuasha7 [link] [comments]
    Generate new designs based on provided dataset?
    I am looking to generate new car wheel designs based on a dataset I have of my own 30-50 wheels. I was wondering if there is a tool/GAN that would suit this? submitted by /u/jrobin51 [link] [comments]
    The advantages and disadvantages of AI in healthcare
    Artificial intelligence is generally praised for their use in many sectors. One of the most notable achievements of AI in science is its breakthrough in molecular biology as it helped reduce the time to identify the structure of complicated 3D protein folds. AI has also helped advance the efficiency in quantum computers, and now AI is making a large presence in clinical/healthcare. Despite the potential of AI being revolutionary in healthcare we must be cautious utilizing it. I want to highlight the advantages and disadvantages of AI to inspire and to set forth challenges for you to approach in the future. The advantage of AI in healthcare is its versatile applications. You can find developments in many topics of healthcare such as drug discovery, image processing, data analysis, and mor…
    Last Week in AI: AI discovers lethal molecules, Deepfake of Ukrainian President Volodymyr Zelenskyy, All-MLP AI model is fast and efficient, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Grapes.
    submitted by /u/cookingandcraft [link] [comments]
    Coined in 2016, AIOps stands for Artificial Intelligence for IT Operations which combines big data and machine learning to automate IT operation processes. Watch this video to learn all about it.
    ​ https://reddit.com/link/tjfnnz/video/e1hw1trsiro81/player submitted by /u/Nitorblog [link] [comments]  ( 1 min )
    Artificial Intelligence and a Sense of Time
    Hi all, I've been interested in pattern theory for a bit, and as far as I can tell it can be viewed as a framework for artificial intelligence. In Grenander's book "Calculus of Thoughts" David Mumford mentions that a major issue with pattern theory is the creation of generators and modalities of bonds. For the bonds, has anyone tried basing their properties on a sense of time? I.e. in whatever environment a program is in, it would have a clock measuring time, and would somehow equate these bonds simply with events happening close to each other in time. Apologies if this is a naïve/vague/nonsensical question, I'm enthusiastic but not too knowledgeable about pattern theory. Any feedback is greatly appreciated submitted by /u/patterntheoryacc [link] [comments]  ( 1 min )
    is there some sort of Ai image generator that continues an image?
    I need an Ai image generator that takes an image and generates what is beyond the border of that image. for example if i have an image of the front half of a dog, it fills in the other half of the dog im going to use it to extend a picture of a building btw submitted by /u/Bingela_ [link] [comments]  ( 1 min )
    The case for human-centered AI
    submitted by /u/bendee983 [link] [comments]
    Weekly metaverse digest #3: virtual humans, HSBC gears for the metaverse, real estate of the future, and more
    submitted by /u/bent_out_of_shape_ [link] [comments]
    Which trends do you think will have the biggest impact on businesses in the upcoming years?
    View Poll submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Chemical X NUMBUH-841
    submitted by /u/VIRUS-AOTOXIN [link] [comments]
    Performance testing FastAPI ML APIs with Locust | Rubik's Code
    submitted by /u/RubiksCodeNMZ [link] [comments]
    Has anyone trained an AI to tell apart nudity in images, so it can be used to filter frames/scenes in or out of a video?
    I know the question sounds funny but I mean it seriously. Is there something trained like this, that could be easily plugged to something like FFmpeg? Doing it on every frame would probably be too slow, but one could probably do it only on P-frames with a very small increase of error rate. submitted by /u/flying-benedictus [link] [comments]  ( 1 min )
    Switchblade Drones join the fight: U.S.-made drones will be sent to Ukraine with the additional $800 million in military assistance that Joe Biden announced on Wednesday. At least 100 Switchblade Tactical Drones will be in the package.
    submitted by /u/dannylenwinn [link] [comments]  ( 1 min )
    LinkedIn Researchers Open-Source ‘FastTreeSHAP’: A Python Package That Enables An Efficient Interpretation of Tree-Based Machine Learning Models
    Researchers from LinkedIn open-source the FastTreeSHAP package which is a Python module based on the paper ‘Fast TreeSHAP: Accelerating SHAP Value Computation for Trees.’ Implementing the widely-used TreeSHAP algorithm in the SHAP package allows for the efficient interpretation of tree-based machine learning models by estimating sample-level feature significance values. Its package includes two new algorithms: FastTreeSHAP v1 and FastTreeSHAP v2, both of which improve TreeSHAP’s computational efficiency by taking a different approach. The empirical benchmarking tests show that FastTreeSHAP v1 is 1.5x faster than TreeSHAP while keeping memory costs the same, and FastTreeSHAP v2 is 2.5x faster while using slightly more memory. The FastTreeSHAP package fully supports parallel multi-core computing to speed up its computation. Continue Reading The Full Summary Article Paper: https://arxiv.org/pdf/2109.09847.pdf Github: https://github.com/linkedin/fasttreeshap submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    [D] How to compute the probability of trajectories term in Stochastic Gradient Meta Reinforcement Learning
    I asked this question on stats stackexchange, and it is posted here. I don't copy it here because formulas don't show up nicely on reddit. I am trying to implement this paper by Fallah et al (NIPs 2021) titled: On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning. As the title suggests, they propose an algorithm for meta RL that uses an stochastic approximation of the gradient. My problem is with the term that yields the probability of a given trajectory (a sequence of state-actions). I don't know how to estimate that term and the paper doesn't discuss that. I'd appreciate if anyone can share any insight on how to estimate that term. submitted by /u/carlml [link] [comments]  ( 1 min )
    [D][P] Neural Network as Frequency Offset Corrector
    Hi everyone! I'm working on a project where I'm using neural networks to overcome some RF impairments in a wireless communications system. When I introduce a frequency offset to the data, my current neural network (which is a simple 2 hidden layer feed-forward NN) can not compensate. The frequency offset is basically a phase rotation on the IQ samples. The problem is this is also time-dependent such that the amount the sample is shifted is a function of time. ie: x_out[n] = x[n]e^(j2*pi*w*n) where w is the phase shift from the frequency offset, n is the discrete sample. I know my NN needs some kind of memory (so I should use an LSTM or some RNN variant), just wondering if anyone has some recommended papers or has done some work similar to this and would be willing to help, thanks! submitted by /u/mtot10 [link] [comments]  ( 1 min )
    [P] A new way to start, stop and distribute code on cloud instances/pods - directly from any notebook, only when you need it
    Hi everyone! With a friend, we are develpoing a tool to be able to easily start pods, distribute the code (with magics) and stop them directly from your notebook. Pods can be started in seconds with the specifications you need: number of instances/pods, CPU, Memory, GPU and packages The tool makes it easy to increase computer resources quickly, code locally and consume resources only when you need it, paralelize tasks and do parameter grid search. If you like it, you can visit our landing page and register for early access: telekinesis.cloud We are giving 10 GPU hours to our first 50 registered users! We hope you like it! Thank you! Please share feedback submitted by /u/snuns90 [link] [comments]  ( 1 min )
    [R] EvoRL @ GECCO 2022: 2nd Evolutionary RL workshop @ GECCO 2022
    CALL FOR PAPERSEvoRL 2022Evolutionary Reinforcement Learning workshop at GECCO 2022, July 9-13, Boston, USA In recent years reinforcement learning (RL) has received a lot of attention thanks to its performance and ability to address complex tasks. At the same time, multiple recent papers, notably work from OpenAI, have shown that evolution strategies (ES) can be competitive with standard RL algorithms on some problems while being simpler and more scalable. Similar results were obtained by researchers from Uber, this time using a gradient-free genetic algorithm (GA) to train deep neural networks on complex control tasks. Moreover, recent research in the field of evolutionary algorithms (EA) has led to the development of algorithms like Novelty Search and Quality Diversity, capable of eff…  ( 2 min )
    [P] Identifying someone's professional background through his syntax.
    Hi, As the title points out, I would like to identify someone's professional background (e.g. engineering, art, ...) through his syntax. To do so my strategy was to find a dataset regrouping messages alongside the person's background but it doesn't exist (to my knowledge) Do you guys have any idea to persue? (any good dataset?, trying to find multiple datasets and merging them?, scraping LinkedIn?) Is it even possible to e.g. scrape LinkedIn for such information? Any thoughts and feedback are welcome! Initial strategy: - Either finding a dataset or scraping LinkedIn. (is it even possible?) - Training a NN. Alternative question: - Do you know of any other approach that would be less data intensive using for instance hand crafted features, computational linguistics and the such? Cheers, submitted by /u/Inquation [link] [comments]  ( 2 min )
    [R][P] AdaFamily: A family of Adam-like adaptive gradient methods
    submitted by /u/HannesF99 [link] [comments]  ( 1 min )
    [R] Monotonic Differentiable Sorting Networks w/ animated video (ICLR 2022)
    I have made an animated video (https://www.youtube.com/watch?v=Rl-sFaE1z4M) for our ICLR 2022 paper (https://arxiv.org/abs/2203.09630). Check it out if you are interested. I have made the video using 3b1b's manim library (https://github.com/ManimCommunity/manim). Feedback is always very welcome! submitted by /u/Human-Career-9962 [link] [comments]  ( 1 min )
    [N] LinkedIn open sources FastTreeSHAP Python package for interpretation of tree-based ML models
    LinkedIn open sources the FastTreeSHAP Python package for efficient interpretation of tree-based ML models (XGBoost, LightGBM, sklearn random forest) using SHAPLEY. FastTreeSHAP v2 would be 2.5x faster than TreeSHAP. Let's reminder that SHAP (SHapley Additive exPlanation) values quantify the contribution of each feature to the model prediction, a bit like how each player contributes to the success of a sports team. SHAP does it by incorporating concepts from game theory and local explanations. Naively implemented, SHAP takes exponential time. LinkedIn blog post, scientific paper, and GitHub repo with IPython Notebooks. submitted by /u/ClaudeCoulombe [link] [comments]  ( 1 min )
    [D] Using ML Interpretability Techniques for Data Analysis
    Hey, all. Hope you lot are doing alright. I was looking into Explainable AI and model interpretability lately, and I had an idea but am wondering whether it would constitute a valid use case. There is a data analysis project happening at work where we're trying to analyze data we had on hand to determine the factors that affect our KPOs and possibly derive useful actionable insights. Instead of moving forward with manually evaluating correlation and doing EDA that way, I was thinking of using variable importance measures, subpopulation analyses, and partial dependence profiles to potentially gain insights, or at the very least narrow down exploratory scope. I have been scrounging the internet for someone using white box models for expediting analysis tasks, but I haven't found anything that remotely fits the bill except perhaps this article: https://medium.com/swlh/how-i-used-a-machine-learning-model-to-generate-actionable-insights-3aa1dfe2ddfd It'd be great if you guys let me know what you think. Cheers, submitted by /u/Impartial_Bystander [link] [comments]  ( 1 min )
    [D] Why is the predictive variance of Bayesian models can be interpreted as epistemic uncertainty?
    Is there a direct link or explanation on why the predictive variance of a Bayesian model can be interpreted as epistemic uncertainty? Many epistemic uncertainty quantification methods, for example, MC-Dropout, Deep Ensemble, claim themselves as a Bayesian model to justify their approach for quantifying uncertainty. Let's just say their claims are valid so that the methods are truly Bayesian. But then, why are Bayesian methods supposed to estimate uncertainty well? submitted by /u/swyoon_ [link] [comments]  ( 1 min )
    [D] Dealing with normalisation of data containing extreme values
    Hi everyone, Imagine there’s a rainfall dataset, and there’s surely extreme values in the dataset. These extreme values are important because I would need them to predict extreme events. Unfortunately, when doing normalisation, these extreme values will tend to result in almost every other values being extremely small after min-max. Any suggestions to go about this? This is for research submitted by /u/plsendfast [link] [comments]  ( 2 min )
    [D] Equivariant Subgraph Aggregation Networks
    An author interview on the Equivariant Subgraph Aggregation Networks paper. Discusses why the expressive power of GNNs is limited and a method for breaking the bottleneck of the 1-WL algorithm https://youtu.be/VYZog7kbXks submitted by /u/zjost85 [link] [comments]
  • Open

    Recommended materials for beginners in theoretical RL
    submitted by /u/Rich_Beautiful [link] [comments]  ( 1 min )
    How initialize variable in RL to reproduce results
    I am using stabel-baselines frameworks to train the model. To build an agent environment I am using Gym environment. I want to make sure that results do not change very much when I run the same model a second time. How is it possible? submitted by /u/Mariam_Dundua [link] [comments]  ( 1 min )
    League of Legends A New Dawn Trailer (Resolution increased with the help of neural networks up to 8K 60FPS)
    submitted by /u/stepanmetior [link] [comments]  ( 1 min )
    [CfP] EvoRL @ GECCO 2022: 2nd Evolutionary RL workshop @ GECCO 2022
    CALL FOR PAPERS EvoRL 2022 Evolutionary Reinforcement Learning workshop at GECCO 2022, July 9-13, Boston, USA In recent years reinforcement learning (RL) has received a lot of attention thanks to its performance and ability to address complex tasks. At the same time, multiple recent papers, notably work from OpenAI, have shown that evolution strategies (ES) can be competitive with standard RL algorithms on some problems while being simpler and more scalable. Similar results were obtained by researchers from Uber, this time using a gradient-free genetic algorithm (GA) to train deep neural networks on complex control tasks. Moreover, recent research in the field of evolutionary algorithms (EA) has led to the development of algorithms like Novelty Search and Quality Diversity, capable of…  ( 2 min )
    What are some real life examples of dynamics in reinforcement learning?
    Dynamics in reinforcement learning, that are represented by the transition function in an MDP, are meant to modelize the probability of reaching (or deriving from) the desired state. From what I understand, this probability is caused by the environement or by a malfunction within the agent(?). I would like to ask if there is any real life problem modelized into an RL model with a real life transition probability? I would be super grateful if you redirect me to research papers in this axis (all I found so far are theoretical representation of dynamics) Thanks in advance. submitted by /u/Unfinished-plans [link] [comments]  ( 3 min )
    "SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning", Park et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy", Yuanhan et al 2022 {Sensetime} (69m categorized images)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Modern Hopfield Networks for Return Decomposition for Delayed Rewards", Widrich et al 2021
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    Solar declination
    This post expands on a small part of the post Demystifying the Analemma by M. Tirado. Apparent solar declination given δ by δ = sin-1( sin(ε) sin(θ) ) where ε is axial tilt and θ is the angular position of a planet. See Tirado’s post for details. Here I want to unpack a couple things […] Solar declination first appeared on John D. Cook.  ( 1 min )
  • Open

    League of Legends A New Dawn Trailer (Resolution increased with the help of neural networks up to 8K 60FPS)
    submitted by /u/stepanmetior [link] [comments]  ( 1 min )
    Link to free resources are mentioned in the description of the video!!
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Performance testing FastAPI ML APIs with Locust | Rubik's Code
    submitted by /u/RubiksCodeNMZ [link] [comments]
  • Open

    Artificial Intelligence Applications for 2022
    The field of Artificial Intelligence (AI) continues to expand and improve by leaps and bounds. Today’s AI applications are becoming smarter…  ( 3 min )
  • Open

    Accelerating Ukraine Intelligence Analysis with Computer Vision on Synthetic Aperture Radar Imagery
    Figure 1: Airmass measurements over Ukraine from February 18, 2022 - March 01, 2022 from the SEVIRI instrument. Data accessed via the EUMETSAT Viewer. Satellite imagery is a critical source of information during the current invasion of Ukraine. Military strategists, journalists, and researchers use this imagery to make decisions, unveil violations of international agreements, and inform the public of the stark realities of war. With Ukraine experiencing a large amount of cloud cover and attacks often occuring during night-time, many forms of satellite imagery are hindered from seeing the ground. Synthetic aperture radar imagery penetrates cloud cover, but requires special training to interpret. Automating this tedious task would enable real-time insights, but current computer vision meth…  ( 11 min )

  • Open

    [R][P] StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    [P] Problems with rainfall prediction with neural network.
    So I am trying to make a neural network which would predict if it's raining or not. However I'm having problems with the neural network because it seems to not learn at all. Or at least it gives very low propability for any data which I try to predict. The dataset contains temperature, pressure, and relative humidity which are mesaured every 10 minutes. The data is from Finnish Weather forecast center (Ilmatieteenlaitos). I'm also modifying the data so that I can feed 24h of values for each rainfall value in to the neural network. Here is a photo explaining this better https://imgur.com/a/jFpsN8Q and here is the code for the project https://www.kaggle.com/code/juhosyvjrvi/rainfall-prediction/notebook Thank you for any help in advance! submitted by /u/Juuhis999 [link] [comments]  ( 1 min )
    [R] Explaining the Explainable AI: A 2-Stage Approach - Link to a free online lecture by the author in comments
    submitted by /u/pinter69 [link] [comments]  ( 2 min )
    [D] code to visualize attention heads
    I was wondering if anyone has any code they use in python to visualize attention head maps for transformer architectures. If so are you willing to share it? submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [D] Does RepL4NLP accept papers such as pruning and quantization?
    Does RepL4NLP accept papers such as pruning and quantization? The link below gives you a list of topics if you scroll down. One of them was " Efficient learning of representations and inference: with respect to training and inference time, model size, amount of training data, etc.". I was wondering if that has anything to do with pruning and/or quantization? ​ https://sites.google.com/view/repl4nlp2022/ submitted by /u/SiegeMemeLord [link] [comments]  ( 1 min )
    [Discussion] / [Reaearch]: Anomaly detection in massive datasets (70-100 TerraBytes)
    Hi everyone. I am looking for reaearch papers and ideas on how to solve this problem for my master’s thesis. Any pointers to SOTA reearch would be greatly appreciated. Let me just quickly summarize the problem at hand: There are over 100k devices connected to this cloud network, where the devices track the GPS location of shipping containers. Most of the containers also have a cooling system, where it’s possible to read a couple of sensor data and also remotely control temperature/humidity in the containers. My main objective is to apply some ML algo to try and predict if some device (or the integrated cooling system) is about become faulty, such that it can trigger an alarm and consequently get someone to inspect the devices due to possible malfunction. The approach: Just on top of my mind, I would like to apply some unsupervised learning technique to detect if some device is behaving far from the norm. I am even more intetested in supervised learning, but it might be a little tricky to tell based on a record if it’s faulty (target variable) Any ideas? Feel free to ask if something isn’t clear. Training a model on 100TB of data might become a challenge in itself. I would have to apply some algo that can learn iteratively. Thanks! submitted by /u/shotez [link] [comments]  ( 2 min )
    [D] Papers about 'true' utilization of pruning
    Good day, my fellow researchers and engineers. Most of the unstructured pruning algorithms cannot induce their theoretical performance due to the fact that we cannot ignore the zeros when the actual computation occurs. I recently read the paper about N:M structured sparse neural networks and I just got interested in the techniques and research about 'the actual acceleration for zeroes'. If we want to dig more about these kinds of sparsity acceleration ( If I'm calling it right), which paper should we look into? How can we ignore the zeroes, aside from saving indices of non-zero values? submitted by /u/KindAd9527 [link] [comments]  ( 1 min )
    [P] Find and fix bugs in ML code
    Hey everyone, it's notoriously hard to find bugs in ML models because often times they are silent. I've created a VSCode extension that can scan your Python functions and find bugs. Under the hood it utilizes a ML model that tries to understand your intent and provide suggestions. Please, could you give me some feedback? I would really appreciate any help. Thanks a lot in advance! Website: https://tensorbox.ai Video demo: https://youtu.be/MSpsCpPZokY submitted by /u/CosmicRaysGalaxy [link] [comments]  ( 1 min )
  • Open

    Google Colab to a Ploomber pipeline: ML at scale
    In this short blog, we’ll review the process of taking a POC data science pipeline (ML/Deep learning/NLP) that was conducted on Google Colab, and transforming it into a pipeline that can run parallel at scale and works with Git so the team can collaborate on. Background Google Colab is pretty straightforward, you can open a… Read More »Google Colab to a Ploomber pipeline: ML at scale The post Google Colab to a Ploomber pipeline: ML at scale appeared first on Data Science Central.  ( 3 min )
    Biases in CLIP and the Stanford HAI report
    The Stanford HAI (Human cantered artificial intelligence) report is out. I track this report every year and it always has some good insights. The report is a bit more focused on large language models and their impact – but also covers key trends The main findings are Private investment in AI soared while investment concentration… Read More »Biases in CLIP and the Stanford HAI report The post Biases in CLIP and the Stanford HAI report appeared first on Data Science Central.  ( 2 min )
    How to Modernize Enterprise Data and Analytics Platform (Part 1 of 4)
    Data and Analytics Platform is a sub-platform of the Enterprise Digital Business Technology Platform. It contains information management and analytical capabilities. Introduction Building the digital business technology platform is a core aspect of enterprise endeavors to support its digital business transformation and therefore gain a sustainable competitive advantage. The digital business technology platform includes 5… Read More »How to Modernize Enterprise Data and Analytics Platform (Part 1 of 4) The post How to Modernize Enterprise Data and Analytics Platform (Part 1 of 4) appeared first on Data Science Central.  ( 6 min )
  • Open

    Reverse engineering Fourier conventions
    The most annoying thing about Fourier analysis is that there are many slightly different definitions of the Fourier transform. One time I got sufficiently annoyed that I creates a sort of Rosetta Stone of Fourier theorems under eight different conventions. Later I discovered that Mathematica supports these same eight definitions, but with slightly different notation. […] Reverse engineering Fourier conventions first appeared on John D. Cook.  ( 3 min )
  • Open

    [D] Should I go for the PhD?
    Like many others on here, I've hit a crossroad in life and I'm unsure whether I want to do a PhD or stay in the industry, so I am looking for some advice on my particular situation. About me: Interested in creating intelligent robots. As cringy as it sounds, I do believe that star-wars-like intelligent robots are possible within my lifetime and my goal is to contribute towards that I love love love creating novel breakthrough technology that has the potential to improve everyone's lives. I like writing papers and creating tech in equal amounts. I love seeing the impact of my work firsthand. At this point in time, I don't think I'll enjoy an academic career because I like doing things with a more immediate impact. Thus I think I'd like to end up in industry research. Graduated MSc Rob…  ( 6 min )
    drone environment ?
    Hi all. I need to implement a drone env to train neural network Capable of stabilizing a drone after throwing it. any suggestions for pre built envs or where to find informations on what i should consider if i want to build one on my own? I know how to use pybullet and the open ai gym interface so building one is not out of the question but a pre built one by a more experienced people would be better given the fact that I'm on tight schedule Sorry for my English not a native speaker :) submitted by /u/HerForFun998 [link] [comments]  ( 1 min )
    Question about unsupervised reinforcement learning finetune process
    Hi, I'm trying to implement unsupervised RL algorithms, like vision-based APT. I trained my agent and saw some meaningful behaviors in pre-training session. but, at finetune session, the agent showed bad adjustment. I guess over-estimation was occurred. (The experiment was action repeat 8 so it looks worse than official drq implementation.) Orange : scratch DrQ, gray : Finetuned version after APT 2M frames training Orange : scratch DrQ, gray : Finetuned version after APT 2M frames training How can I finetune that I had already pre-trained agent well? ​ Thanks for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    MIT Researchers Developed A Machine-Learning Model To Generate Extremely Realistic Synthetic Data That Can Train Another Model For Downstream Vision Tasks
    Many machine learning tasks require high-quality data, such as assessing damage in a satellite that negatively affects a model’s performance. Datasets can cost millions of dollars to create if useable data exists, and even the best datasets sometimes contain biases that negatively impact a model’s performance. Many scientists have been working to answer an intriguing question working with synthetic data sampled from a generative model instead of real data. A generative model is a machine-learning model that requires significantly less memory to keep or share than a dataset. The range and quality of generative models have improved dramatically in recent years. Synthetic data has the ability to get around some of the privacy and usage rights problems that limit how actual data may be distributed. A generative model could potentially be updated to eliminate particular attributes, such as race or gender, to overcome biases in traditional datasets. New research by MIT Team develops a method for training a machine learning (ML) model that, rather than requiring a dataset, employs a particular form of ML model to generate exceptionally realistic synthetic data that can train another model for downstream vision tasks. Continue Reading Paper: https://openreview.net/pdf?id=qhAeZjs7dCL Github: https://github.com/ali-design/GenRep Project: https://ali-design.github.io/GenRep/ submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Ai for videos characters
    Is there any good ai for automating virtual characters in movies and games? submitted by /u/Ok-Suspect-9855 [link] [comments]
    Smooth Complex 3D Scenes from a Couple of Images!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Can AI help when it's "more of an art than a science"?
    submitted by /u/Representative-Job23 [link] [comments]
    Flowers.
    submitted by /u/cookingandcraft [link] [comments]
    desconected
    Some one knows If there's a way for not geting disconected using the S2ML generator on Google colab? submitted by /u/Pale-Information3077 [link] [comments]
  • Open

    Building An Effective Experimentation Program – 01 Introduction
    Seamless. Frictionless. Elegant. Efficient. Read More The post Building An Effective Experimentation Program – 01 Introduction appeared first on ML in Production.  ( 4 min )
  • Open

    Building An Effective Experimentation Program – 01 Introduction
    Seamless. Frictionless. Elegant. Efficient. Read More The post Building An Effective Experimentation Program – 01 Introduction appeared first on ML in Production.  ( 4 min )
  • Open

    Constrained optimization with newton's method for real-life applications
    Hi! Does anyone know of any real-life applications of constrained optimization using newton's method in deep learning (specifically in minimizing the cost function)? Or if constrained optimization is even used at all when minimizing the cost function? I know that Newton's method can be used to find the minimum of the cost function in logistic regression. However, would there be any instances where we would add constraints to the cost function in logistic regression, therefore allowing us to use constrained optimization techniques with newton's method (such as Lagrange multipliers or interior-point methods). submitted by /u/No_Pop3856 [link] [comments]  ( 1 min )
  • Open

    Improving Human Sperm Head Morphology Classification with Unsupervised Anatomical Feature Distillation. (arXiv:2202.07191v3 [cs.CV] UPDATED)
    With rising male infertility, sperm head morphology classification becomes critical for accurate and timely clinical diagnosis. Recent deep learning (DL) morphology analysis methods achieve promising benchmark results, but leave performance and robustness on the table by relying on limited and possibly noisy class labels. To address this, we introduce a new DL training framework that leverages anatomical and image priors from human sperm microscopy crops to extract useful features without additional labeling cost. Our core idea is to distill sperm head information with reliably-generated pseudo-masks and unsupervised spatial prediction tasks. The predicted foreground masks from this distillation step are then leveraged to regularize and reduce image and label noise in the tuning stage. We evaluate our new approach on two public sperm datasets and achieve state-of-the-art performances (e.g. 65.9% SCIAN accuracy and 96.5% HuSHeM accuracy).  ( 2 min )

  • Open

    [Discussion] Modelling Explicit and Implicit knowledge in RL
    Hi all, I recently read the YOLOR (You Only Learn One Representation) paper ( https://arxiv.org/abs/2105.04206 ). One of the key takeaways from this paper is that modelling implicit knowledge is crucial. And in order to make the model learn this implicit knowledge, simply "relaxing" the error term in cross-task learning is not enough, because then conventional loss functions won't be able to capture this implicit knowledge. So in order to tackle this problem, they model the error term for various tasks such that implicit knowledge can be learned. This reminded me very much of meta-RL, except I haven't seen any implementations of it where they explicitly model this per-task error and use a similar implicit+explicit knowledge learning framework. Since YOLOR is currently SOTA for realtime object detection and some other tasks, it seems like this was quite a significant insight and I wonder if anyone is aware of similar works in the "RL branch" of AI? I'd love to hear some thoughts about this or if someone can point me to related literature this would be great. In case this hasn't been looked into before, I'd love to start a PhD regarding this topic to learn more about it haha submitted by /u/Chronicle112 [link] [comments]  ( 1 min )
    [D] Very frustrated with the state of peer-reviewing in math/computational non-ML journals when it comes to ML-based approaches
    A bit of a rant here. I received a "major revision" recommendation for a paper I submitted a few months ago (was self-contained, no ML background was necessary). The field is mathematical and the journal also addresses numerical & computational aspects of it, but applying ML (especially neural net based L2 regressions) is still a bit unheard of there. As such, I made sure to dumb everything down for them and still gave the mathematical justifications, including convergence proofs. In short, it was about estimating certain conditional expectations having a very special structure, without resorting to heavy brute-force approaches (like nested Monte-Carlo), using L2 projections with neural networks and a special sampling scheme which was also explained in detail. In such an approach neural n…  ( 3 min )
    [R][P] MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning + Hugging Face Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Complete Guide of Swin Transformer with Full PyTorch Implementation
    Hello everyone! I’ve recently read Swin Transformer paper and tried to implement with PyTorch. But there’re no post that FULLY explains the nitty-gritty details of the paper with full implementation. It took me soooo long time to write this post so I wanted to share with y’all! Hope this helps someone! The implementation is based on the official implementation of Microsoft team. https://jasonlee-cp.github.io/paper/Swin_Transformer/#swin-transformer-architecture submitted by /u/JasonTheCoders [link] [comments]  ( 1 min )
    [D] Conferences to publish ML tooling?
    Hey, I write a lot of ML tooling and am hoping to present some of it in conferences somewhere. The main goal of publishing somewhere is to have a work that can be cited and is peer reviewed, but the real focus is on the package. Some people are weird about citing GitHub packages. Are there any specialized conferences for this sort of thing? Or do you just try your luck at a more niche conference? submitted by /u/puppet_pals [link] [comments]  ( 1 min )
    [P] Reading time of analog clocks
    https://github.com/akucia/analog-watch-recognition ​ https://preview.redd.it/4xrti7icgdo81.jpg?width=1200&format=pjpg&auto=webp&s=b76fd4aee01ef2eda4c9bb1c739d10ff38fe3cab submitted by /u/kuciu [link] [comments]
    [R] Code + Data for Paper (AAAI 2022): Multiple-Source Domain Adaptation via Coordinated Domain Encoders and Paired Classifiers
    You can find the paper here: https://arxiv.org/abs/2201.11870 And the code and the data here: https://github.com/p-karisani/CEPC submitted by /u/payam_ka [link] [comments]
    Ensemble Learning with Random Forest and Dense Neural Network. [R]
    I am working on a hospital dataset, My Y variable is the total cost of the diagnosis. I wanna predict that so it's a regression problem, Ultimately. I used random forest and Dense NN Separately and I got pretty decent scores but is it possible to ensemble random forest and neural network. submitted by /u/nibras_28 [link] [comments]  ( 1 min )
    [D] Physics and Reinforcement Learning - Discussion of Deepmind's work
    I chatted with a few experts about Deepmind's recent work on using RL for Nuclear Fusion and Plasma Control. We collected some resources and took some notes to understand the work better. Thought to share them here: [video] Physics and Reinforcement Learning What is fusion? What is plasma? What is a Tokamak? Why do we care about conrolling plasma? The RL task is a continuous control task. That means that the inputs, “controller” take continuous, real values the same as would result from sliding a slider or turning a dial. Other examples of continuous control tasks are There are 19 of these metaphorical “dials” controlling coils in the tokamak. The task of the RL agent, then, is to learn how best to control those 19 “dials” given observations from the sensors within the tokamak.…  ( 5 min )
    [Discussion] What are some papers you read which helped you build an intuition of how neural networks function?
    I started working through this paper on how neural networks divide and map their input space. It helped me visualize what happens in the hidden layers and how they help in better function approximation. I'm looking for some suggestions on papers to read which help me build a stronger intuition on how and why neural nets work. submitted by /u/theanswerisnt42 [link] [comments]  ( 2 min )
    [P] DeepForSpeed: A self driving car in Need For Speed Most Wanted with just a single ConvNet to play ( inspired by nvidia )
    submitted by /u/toxickettle [link] [comments]  ( 3 min )
    [D] Applications for using reinforcement learning to fine-tune GPT-2
    I'm currently writing my Master's thesis on the merits and pitfalls of using reinforcement learning, specifically Proximal Policy Optimization, to finetune large language models. The project is inspired by the paper Fine-Tuning Language Models from Human Preferences, which uses a reward model trained on a dataset of human preferences to supply a reward signal used for fine-tuning GPT-2 to generate summaries and continuations that are optimized for human preference. I'm investigating the pros and cons of a more naive approach that does not require collecting a dataset of human preferences. Using the trl library, I train a BERT-classifier to distinguish between sarcastic and non-sarcastic reddit comments, and that classifier then serves as a reward model that provides a reward signal for fine-tuning GPT-2 for text generation using PPO. I have applied the same method to the task of generating negative review, by training BERT on the IMDB-dataset. This method of course leads to extensive reward hacking, but investigating how to mitigate that is part of the fun! I'm looking for inspiration for what other NLP tasks I could apply this to. The important constraint is that the tasks have to be something that I can train a classifier to evaluate, or otherwise evaluate algorithmically using metrics like ROUGE, because I need to supply GPT-2 with a scalar reward signal. What other tasks should I try this method out on? submitted by /u/Sisyfos42 [link] [comments]  ( 1 min )
    [D] Extensions to U-nets
    I've been working with U-nets, and I’m curious about the possible extensions to improve the performance of U-nets. I found several so far: Variational U-nets: https://proceedings.neurips.cc/paper/2018/file/473447ac58e1cd7e96172575f48dca3b-Paper.pdf Unet++: https://arxiv.org/pdf/1912.05074.pdf Stacked Dilated U-nets: https://arxiv.org/pdf/2004.03466.pdf There seems to be more ways to expand on the traditional U-net architecture that I've yet to discover. I'd love to see other cool methods! submitted by /u/2133 [link] [comments]  ( 1 min )
    [R][P] StyleSwin: Transformer-based GAN for High-resolution Image Generation + Hugging Face Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    [P] Automatically Calibrate RGBD Cameras with PyTorch
    submitted by /u/covertBehavior [link] [comments]  ( 2 min )
    [P] Non-discrete sample "classification"
    I'm a biologist, and I have a problem with a very common analysis done in my field. We often classify cells by unique profile proteins they express. Cells that are high in protein A but low in B may be called "Type 1", cells low in protein A but high in protein B may be called "Type 2", cells high in both A and B "Type 3", and cells low in both "Type 4". Sometimes this works well and cells are clearly one type or the other. But unfortunately nature doesn't care about our desire to neatly classify things, and I believe that cell identity exists on a spectrum. Protein expression isn't all or nothing, it's effectively a continuous variable. There are cases where some cells are probably actually "Type 1", some are actually "Type 2", but some meaningfully exist as "somewhere between Type 1 and…  ( 4 min )
    [Discussion] Often-times data scientists work in silos. And later on in their project, they realize that “I wish I had known this ….”
    Often-times data scientists work in silos. And later on in their project, they realize that “I wish I had known this ….” Some of the questions that arise are: ​ [Data] Training data is not representative of Production Data [Economics] Business had different success criteria than data scientist’s evaluation metrics [Engineering] Do not have an infrastructure in place to support models [Engineering] Product backend is written in Java and your pipeline is in Python [Legal] Need to take approval from Legal and Governance to consider data privacy, bias, and fairness [Stakeholders] Explainability is more important than results Have you come across any such situations? submitted by /u/rajg88 [link] [comments]  ( 1 min )
  • Open

    "Agile Locomotion via Model-free Learning", Margolis et al 2022
    submitted by /u/gwern [link] [comments]
    DDPG - Neural Network output does not depend on the input
    Hi all! I am trying to use RL in order to train an agent that provides best direction to move in order to maximize the coverage of an area and I am giving reward on the basis of coverage increase at each step. I am implementing such algorithm both for 1 agent (so I am using simple DDPG) and for a set of agents (using MADDPG), but if I make the exploration noise decay, I noticed that the selected action of the model is always the same, independently on the input it receives. In fact, even during training, selected action is always a little update with respect to the one selected previously, even if the episode changes, and the inputs too. I have tried increasing/decresing the effect of noise (I am using Ornstein Uhlenbeck process) or decreasing the learning rates, but this seems to have no effect. What may be the problem of such a strange behaviour? submitted by /u/cosimobr [link] [comments]  ( 1 min )
    Advice on network size
    I'm making DQN for automatic trading. Due to the nature of the problem, I figured the model would possibly need to be quite complex, but actually I have no idea just how big it should be. I am feeding it with 8000 numbers representing the last prices. Due to some reasons still unknown to me, I can basically add as many layers as I want as long as they are not bigger than 4096 neurons without increasing learning speed. I just wanted to ask: would 20 layers be too many ? Would 5 probably not be enough ? Right now the model looks like that: Dense(2048, "relu") Dense(1024, "relu") Dense(512, "relu") Dense(256, "relu") Dense(128, "relu") Dense(1024, "relu") Dense(2048, "relu") Dense(1024, "relu") Dense(512, "relu") Dense(256, "relu") Dense(128, "relu") Dense(64, "relu") Dense(32, "relu") Dense(16, "relu") Dense(8, "relu") Dense(3, "relu") Any help is appreciated, really I just want to have an idea of how much would be too much or too little for this specific problem, I'm a total beginner. submitted by /u/superpowers_fan [link] [comments]  ( 2 min )
  • Open

    Machine learning and phone data can improve targeting of humanitarian aid
    submitted by /u/Jazmineco [link] [comments]
    Artificial Nightmares: Burning Teldrassil || Clip Guided Disco Diffusion AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    I am trying to use VQGAN+CLIP, but i can't make It work, some one more experienced knows how to fix It?
    submitted by /u/Pale-Information3077 [link] [comments]
    Ghibli Howl’s Moving Castle 11x16” acrylic painting for sale! $180 plus free shipping! Message me if you’re interested.
    submitted by /u/the_easel_art [link] [comments]
    How close are we to an context understanding AI?
    I am layman and just curious about the topic, how far are we to an AI which can understand the question and formulate a accurate and precise answer and not just return us the web results. submitted by /u/curiosityVeil [link] [comments]  ( 1 min )
    AI
    I love technology but I do not like the increasing development of AI. I believe that AI has more dangers than positive impacts on humanity. submitted by /u/joginderMike [link] [comments]  ( 1 min )
    The degradation of war
    submitted by /u/Hacknaut [link] [comments]
    An AI painting some colorful pitbulls
    submitted by /u/notrealAI [link] [comments]  ( 1 min )
  • Open

    Is Data Scientist Weak Link in Data-drive Value Creation?
    I attended an in-person customer event sponsored by Dataiku last week.  Very fun.  Man, do I miss the provocative and enlightening discussions that occur in these face-to-face customer engagements.  The icebreaker for fueling the dinner discussions was the following statement: “In the marketplace, dynamics in the job marketplace will evolve, and data-savvy subject matter experts… Read More »Is Data Scientist Weak Link in Data-drive Value Creation? The post Is Data Scientist Weak Link in Data-drive Value Creation? appeared first on Data Science Central.  ( 7 min )
  • Open

    AI composer> I have forced the model to generate a piece just by using dyads (a 2-note chord)! The challenge is to find notes that match and create beautiful sound. That is why I have named this piece: Love
    submitted by /u/amin_mlm [link] [comments]  ( 1 min )
    I can't debug my dang code
    Hi everyone. I'm trying to debug my code and I can't manage to find what's wrong. I'm running a Roberta model with a classifier head on a Pytorch Lightning trainer and I just get stuck after it is running the optimizer config, it's loading the train data, calling __len__(from dataset) a couple of times, loading val data, calling len again a few times and then complete freeze, what could be going on? Also it's 100% running on the gpu but python is not using any gpu in task manager so I don't think it's actually doing anything. submitted by /u/DifferentMedia2536 [link] [comments]  ( 1 min )
  • Open

    Threat Detection for General Social Engineering Attack Using Machine Learning Techniques. (arXiv:2203.07933v2 [cs.CR] UPDATED)
    This paper explores the threat detection for general Social Engineering (SE) attack using Machine Learning (ML) techniques, rather than focusing on or limited to a specific SE attack type, e.g. email phishing. Firstly, this paper processes and obtains more SE threat data from the previous Knowledge Graph (KG), and then extracts different threat features and generates new datasets corresponding with three different feature combinations. Finally, 9 types of ML models are created and trained using the three datasets, respectively, and their performance are compared and analyzed with 27 threat detectors and 270 times of experiments. The experimental results and analyses show that: 1) the ML techniques are feasible in detecting general SE attacks and some ML models are quite effective; ML-based SE threat detection is complementary with KG-based approaches; 2) the generated datasets are usable and the SE domain ontology proposed in previous work can dissect SE attacks and deliver the SE threat features, allowing it to be used as a data model for future research. Besides, more conclusions and analyses about the characteristics of different ML detectors and the datasets are discussed.  ( 2 min )
    AI Autonomy: Self-Initiation, Adaptation and Continual Learning. (arXiv:2203.08994v1 [cs.AI])
    As more and more AI agents are used in practice, it is time to think about how to make these agents fully autonomous so that they can (1) learn by themselves continually in a self-motivated and self-initiated manner rather than being retrained offline periodically on the initiation of human engineers and (2) accommodate or adapt to unexpected or novel circumstances. As the real-world is an open environment that is full of unknowns or novelties, detecting novelties, characterizing them, accommodating or adapting to them, and gathering ground-truth training data and incrementally learning the unknowns/novelties are critical to making the AI agent more and more knowledgeable and powerful over time. The key challenge is how to automate the process so that it is carried out continually on the agent's own initiative and through its own interactions with humans, other agents and the environment just like human on-the-job learning. This paper proposes a framework (called SOLA) for this learning paradigm to promote the research of building autonomous and continual learning enabled AI agents. To show feasibility, an implemented agent is also described.  ( 2 min )
    Bayesian Framework for Gradient Leakage. (arXiv:2111.04706v2 [cs.LG] UPDATED)
    Federated learning is an established method for training machine learning models without sharing training data. However, recent work has shown that it cannot guarantee data privacy as shared gradients can still leak sensitive information. To formalize the problem of gradient leakage, we propose a theoretical framework that enables, for the first time, analysis of the Bayes optimal adversary phrased as an optimization problem. We demonstrate that existing leakage attacks can be seen as approximations of this optimal adversary with different assumptions on the probability distributions of the input data and gradients. Our experiments confirm the effectiveness of the Bayes optimal adversary when it has knowledge of the underlying distribution. Further, our experimental evaluation shows that several existing heuristic defenses are not effective against stronger attacks, especially early in the training process. Thus, our findings indicate that the construction of more effective defenses and their evaluation remains an open problem.  ( 2 min )
    EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (arXiv:2202.05146v2 [q-bio.BM] UPDATED)
    Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.  ( 2 min )
    Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective. (arXiv:2106.07138v4 [stat.ML] UPDATED)
    Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction's weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, $k$-means clustering, and $k$-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation's accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the paper.  ( 2 min )
    Backpropagation through Time and Space: Learning Numerical Methods with Multi-Agent Reinforcement Learning. (arXiv:2203.08937v1 [cs.LG])
    We introduce Backpropagation Through Time and Space (BPTTS), a method for training a recurrent spatio-temporal neural network, that is used in a homogeneous multi-agent reinforcement learning (MARL) setting to learn numerical methods for hyperbolic conservation laws. We treat the numerical schemes underlying partial differential equations (PDEs) as a Partially Observable Markov Game (POMG) in Reinforcement Learning (RL). Similar to numerical solvers, our agent acts at each discrete location of a computational space for efficient and generalizable learning. To learn higher-order spatial methods by acting on local states, the agent must discern how its actions at a given spatiotemporal location affect the future evolution of the state. The manifestation of this non-stationarity is addressed by BPTTS, which allows for the flow of gradients across both space and time. The learned numerical policies are comparable to the SOTA numerics in two settings, the Burgers' Equation and the Euler Equations, and generalize well to other simulation set-ups.  ( 2 min )
    A Framework and Benchmark for Deep Batch Active Learning for Regression. (arXiv:2203.09410v1 [stat.ML])
    We study the performance of different pool-based Batch Mode Deep Active Learning (BMDAL) methods for regression on tabular data, focusing on methods that do not require to modify the network architecture and training. Our contributions are three-fold: First, we present a framework for constructing BMDAL methods out of kernels, kernel transformations and selection methods, showing that many of the most popular BMDAL methods fit into our framework. Second, we propose new components, leading to a new BMDAL method. Third, we introduce an open-source benchmark with 15 large tabular data sets, which we use to compare different BMDAL methods. Our benchmark results show that a combination of our novel components yields new state-of-the-art results in terms of RMSE and is computationally efficient. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.  ( 2 min )
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v1 [cs.LG])
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.  ( 2 min )
    Localizing Visual Sounds the Easy Way. (arXiv:2203.09324v1 [cs.CV])
    Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time for improved precision. Our simple and effective framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source. In particular, we improve the CIoU of the Flickr SoundNet test set from 76.80% to 83.94%, and on the VGG-Sound Source dataset from 34.60% to 38.85%. The code is available at https://github.com/stoneMo/EZ-VSL.  ( 2 min )
    Convert, compress, correct: Three steps toward communication-efficient DNN training. (arXiv:2203.09044v1 [cs.LG])
    In this paper, we introduce a novel algorithm, $\mathsf{CO}_3$, for communication-efficiency distributed Deep Neural Network (DNN) training. $\mathsf{CO}_3$ is a joint training/communication protocol, which encompasses three processing steps for the network gradients: (i) quantization through floating-point conversion, (ii) lossless compression, and (iii) error correction. These three components are crucial in the implementation of distributed DNN training over rate-constrained links. The interplay of these three steps in processing the DNN gradients is carefully balanced to yield a robust and high-performance scheme. The performance of the proposed scheme is investigated through numerical evaluations over CIFAR-10.  ( 2 min )
    Learning Long-Term Reward Redistribution via Randomized Return Decomposition. (arXiv:2111.13485v2 [cs.LG] UPDATED)
    Many practical applications of reinforcement learning require agents to learn from sparse and delayed rewards. It challenges the ability of agents to attribute their actions to future outcomes. In this paper, we consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. A popular paradigm for this problem setting is learning with a designed auxiliary dense reward function, namely proxy reward, instead of sparse environmental signals. Based on this framework, this paper proposes a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning. We establish a surrogate problem by Monte-Carlo sampling that scales up least-squares-based reward redistribution to long-horizon problems. We analyze our surrogate loss function by connection with existing methods in the literature, which illustrates the algorithmic properties of our approach. In experiments, we extensively evaluate our proposed method on a variety of benchmark tasks with episodic rewards and demonstrate substantial improvement over baseline algorithms.  ( 2 min )
    PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer. (arXiv:2203.09293v1 [cs.CV])
    Nowadays, our mobility systems are evolving into the era of intelligent vehicles that aim to improve road safety. Due to their vulnerability, pedestrians are the users who will benefit the most from these developments. However, predicting their trajectory is one of the most challenging concerns. Indeed, accurate prediction requires a good understanding of multi-agent interactions that can be complex. Learning the underlying spatial and temporal patterns caused by these interactions is even more of a competitive and open problem that many researchers are tackling. In this paper, we introduce a model called PRediction Transformer (PReTR) that extracts features from the multi-agent scenes by employing a factorized spatio-temporal attention module. It shows less computational needs than previously studied models with empirically better results. Besides, previous works in motion prediction suffer from the exposure bias problem caused by generating future sequences conditioned on model prediction samples rather than ground-truth samples. In order to go beyond the proposed solutions, we leverage encoder-decoder Transformer networks for parallel decoding a set of learned object queries. This non-autoregressive solution avoids the need for iterative conditioning and arguably decreases training and testing computational time. We evaluate our model on the ETH/UCY datasets, a publicly available benchmark for pedestrian trajectory prediction. Finally, we justify our usage of the parallel decoding technique by showing that the trajectory prediction task can be better solved as a non-autoregressive task.  ( 2 min )
    GLM: General Language Model Pretraining with Autoregressive Blank Infilling. (arXiv:2103.10360v2 [cs.CL] UPDATED)
    There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25x parameters of BERT Large , demonstrating its generalizability to different downstream tasks.  ( 2 min )
    Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks. (arXiv:2203.08852v1 [cs.LG])
    We propose a new method for spatio-temporal forecasting on arbitrarily distributed points. Assuming that the observed system follows an unknown partial differential equation, we derive a continuous-time model for the dynamics of the data via the finite element method. The resulting graph neural network estimates the instantaneous effects of the unknown dynamics on each cell in a meshing of the spatial domain. Our model can incorporate prior knowledge via assumptions on the form of the unknown PDE, which induce a structural bias towards learning specific processes. Through this mechanism, we derive a transport variant of our model from the convection equation and show that it improves the transfer performance to higher-resolution meshes on sea surface temperature and gas flow forecasting against baseline models representing a selection of spatio-temporal forecasting methods. A qualitative analysis shows that our model disentangles the data dynamics into their constituent parts, which makes it uniquely interpretable.  ( 2 min )
    Nearest Neighbor Classifier with Margin Penalty for Active Learning. (arXiv:2203.09174v1 [cs.IR])
    As deep learning becomes the mainstream in the field of natural language processing, the need for suitable active learning method are becoming unprecedented urgent. Active Learning (AL) methods based on nearest neighbor classifier are proposed and demonstrated superior results. However, existing nearest neighbor classifier are not suitable for classifying mutual exclusive classes because inter-class discrepancy cannot be assured by nearest neighbor classifiers. As a result, informative samples in the margin area can not be discovered and AL performance are damaged. To this end, we propose a novel Nearest neighbor Classifier with Margin penalty for Active Learning(NCMAL). Firstly, mandatory margin penalty are added between classes, therefore both inter-class discrepancy and intra-class compactness are both assured. Secondly, a novel sample selection strategy are proposed to discover informative samples within the margin area. To demonstrate the effectiveness of the methods, we conduct extensive experiments on for datasets with other state-of-the-art methods. The experimental results demonstrate that our method achieves better results with fewer annotated samples than all baseline methods.  ( 2 min )
    On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. (arXiv:2203.09168v1 [cs.LG])
    Capturing aleatoric uncertainty is a critical part of many machine learning systems. In deep learning, a common approach to this end is to train a neural network to estimate the parameters of a heteroscedastic Gaussian distribution by maximizing the logarithm of the likelihood function under the observed data. In this work, we examine this approach and identify potential hazards associated with the use of log-likelihood in conjunction with gradient-based optimizers. First, we present a synthetic example illustrating how this approach can lead to very poor but stable parameter estimates. Second, we identify the culprit to be the log-likelihood loss, along with certain conditions that exacerbate the issue. Third, we present an alternative formulation, termed $\beta$-NLL, in which each data point's contribution to the loss is weighted by the $\beta$-exponentiated variance estimate. We show that using an appropriate $\beta$ largely mitigates the issue in our illustrative example. Fourth, we evaluate this approach on a range of domains and tasks and show that it achieves considerable improvements and performs more robustly concerning hyperparameters, both in predictive RMSE and log-likelihood criteria.  ( 2 min )
    Near Instance Optimal Model Selection for Pure Exploration Linear Bandits. (arXiv:2109.05131v2 [stat.ML] UPDATED)
    We introduce the model selection problem in pure exploration linear bandits, where the learner needs to adapt to the instance-dependent complexity measure of the smallest hypothesis class containing the true model. We design algorithms in both fixed confidence and fixed budget settings with near instance optimal guarantees. The core of our algorithms is a new optimization problem based on experimental design that leverages the geometry of the action set to identify a near-optimal hypothesis class. Our fixed budget algorithm is developed based on a novel selection-validation procedure, which provides a new way to study the understudied fixed budget setting (even without the added challenge of model selection). We adapt our algorithms, in both fixed confidence and fixed budget settings, to problems with model misspecification.  ( 2 min )
    Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning. (arXiv:2203.09249v1 [cs.LG])
    Federated Learning (FL) is an emerging distributed learning paradigm under privacy constraint. Data heterogeneity is one of the main challenges in FL, which results in slow convergence and degraded performance. Most existing approaches only tackle the heterogeneity challenge by restricting the local model update in client, ignoring the performance drop caused by direct global model aggregation. Instead, we propose a data-free knowledge distillation method to fine-tune the global model in the server (FedFTG), which relieves the issue of direct model aggregation. Concretely, FedFTG explores the input space of local models through a generator, and uses it to transfer the knowledge from local models to the global model. Besides, we propose a hard sample mining scheme to achieve effective knowledge distillation throughout the training. In addition, we develop customized label sampling and class-level ensemble to derive maximum utilization of knowledge, which implicitly mitigates the distribution discrepancy across clients. Extensive experiments show that our FedFTG significantly outperforms the state-of-the-art (SOTA) FL algorithms and can serve as a strong plugin for enhancing FedAvg, FedProx, FedDyn, and SCAFFOLD.  ( 2 min )
    Time and the Value of Data. (arXiv:2203.09118v1 [cs.LG])
    Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including older datasets may, in fact, damage the model's accuracy. Expectedly, the model's accuracy improves by increasing the flow of data (defined as data collection rate); however, it requires other tradeoffs in terms of refreshing or retraining machine learning models more frequently. Using these results, we investigate how the business value created by machine learning models scales with data and when the stock of data establishes a sustainable competitive advantage. We argue that data's time-dependency weakens the barrier to entry that the stock of data creates. As a result, a competing firm equipped with a limited (yet sufficient) amount of recent data can develop more accurate models. This result, coupled with the fact that older datasets may deteriorate models' accuracy, suggests that created business value doesn't scale with the stock of available data unless the firm offloads less relevant data from its data repository. Consequently, a firm's growth policy should incorporate a balance between the stock of historical data and the flow of new data. We complement our theoretical results with an experiment. In the experiment, we empirically measure the loss in the accuracy of a next word prediction model trained on datasets from various time periods. Our empirical measurements confirm the economic significance of the value decline over time. For example, 100MB of text data, after seven years, becomes as valuable as 50MB of current data for the next word prediction task.  ( 2 min )
    Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration. (arXiv:2107.05446v3 [cs.LG] UPDATED)
    Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level of feature-space class-separation in the target domain. We address these issues for a particularly pervasive type of domain shift called measurement shift which can be resolved by restoring the source features rather than extracting new ones. In particular, we propose Feature Restoration (FR) wherein we: (i) store a lightweight and flexible approximation of the feature distribution under the source data; and (ii) adapt the feature-extractor such that the approximate feature distribution under the target data realigns with that saved on the source. We additionally propose a bottom-up training scheme which boosts performance, which we call Bottom-Up Feature Restoration (BUFR). On real and synthetic data, we demonstrate that BUFR outperforms existing SFDA methods in terms of accuracy, calibration, and data efficiency, while being less reliant on the performance of the source model in the target domain.  ( 2 min )
    Multimodal Learning on Graphs for Disease Relation Extraction. (arXiv:2203.08893v1 [cs.LG])
    Objective: Disease knowledge graphs are a way to connect, organize, and access disparate information about diseases with numerous benefits for artificial intelligence (AI). To create knowledge graphs, it is necessary to extract knowledge from multimodal datasets in the form of relationships between disease concepts and normalize both concepts and relationship types. Methods: We introduce REMAP, a multimodal approach for disease relation extraction and classification. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, followed by aligning the multimodal embeddings for optimal disease relation extraction. Results: We apply REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves text-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with text information. Further, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). Discussion: Systematized knowledge is becoming the backbone of AI, creating opportunities to inject semantics into AI and fully integrate it into machine learning algorithms. While prior semantic knowledge can assist in extracting disease relationships from text, existing methods can not fully leverage multimodal datasets. Conclusion: REMAP is a multimodal approach for extracting and classifying disease relationships by fusing structured knowledge and text information. REMAP provides a flexible neural architecture to easily find, access, and validate AI-driven relationships between disease concepts.  ( 2 min )
    Optimal Rejection Function Meets Character Recognition Tasks. (arXiv:2203.09151v1 [cs.CV])
    In this paper, we propose an optimal rejection method for rejecting ambiguous samples by a rejection function. This rejection function is trained together with a classification function under the framework of Learning-with-Rejection (LwR). The highlights of LwR are: (1) the rejection strategy is not heuristic but has a strong background from a machine learning theory, and (2) the rejection function can be trained on an arbitrary feature space which is different from the feature space for classification. The latter suggests we can choose a feature space that is more suitable for rejection. Although the past research on LwR focused only on its theoretical aspect, we propose to utilize LwR for practical pattern classification tasks. Moreover, we propose to use features from different CNN layers for classification and rejection. Our extensive experiments of notMNIST classification and character/non-character classification demonstrate that the proposed method achieves better performance than traditional rejection strategies.  ( 2 min )
    Gaussian initializations help deep variational quantum circuits escape from the barren plateau. (arXiv:2203.09376v1 [quant-ph])
    Variational quantum circuits have been widely employed in quantum simulation and quantum machine learning in recent years. However, quantum circuits with random structures have poor trainability due to the exponentially vanishing gradient with respect to the circuit depth and the qubit number. This result leads to a general belief that deep quantum circuits will not be feasible for practical tasks. In this work, we propose an initialization strategy with theoretical guarantees for the vanishing gradient problem in general deep circuits. Specifically, we prove that under proper Gaussian initialized parameters, the norm of the gradient decays at most polynomially when the qubit number and the circuit depth increase. Our theoretical results hold for both the local and the global observable cases, where the latter was believed to have vanishing gradients even for shallow circuits. Experimental results verify our theoretical findings in the quantum simulation and quantum chemistry.  ( 2 min )
    When Chosen Wisely, More Data Is What You Need: A Universal Sample-Efficient Strategy For Data Augmentation. (arXiv:2203.09391v1 [cs.LG])
    Data Augmentation (DA) is known to improve the generalizability of deep neural networks. Most existing DA techniques naively add a certain number of augmented samples without considering the quality and the added computational cost of these samples. To tackle this problem, a common strategy, adopted by several state-of-the-art DA methods, is to adaptively generate or re-weight augmented samples with respect to the task objective during training. However, these adaptive DA methods: (1) are computationally expensive and not sample-efficient, and (2) are designed merely for a specific setting. In this work, we present a universal DA technique, called Glitter, to overcome both issues. Glitter can be plugged into any DA method, making training sample-efficient without sacrificing performance. From a pre-generated pool of augmented samples, Glitter adaptively selects a subset of worst-case samples with maximal loss, analogous to adversarial DA. Without altering the training strategy, the task objective can be optimized on the selected subset. Our thorough experiments on the GLUE benchmark, SQuAD, and HellaSwag in three widely used training setups including consistency training, self-distillation and knowledge distillation reveal that Glitter is substantially faster to train and achieves a competitive performance, compared to strong baselines.  ( 2 min )
    SoK: Differential Privacy on Graph-Structured Data. (arXiv:2203.09205v1 [cs.CR])
    In this work, we study the applications of differential privacy (DP) in the context of graph-structured data. We discuss the formulations of DP applicable to the publication of graphs and their associated statistics as well as machine learning on graph-based data, including graph neural networks (GNNs). The formulation of DP in the context of graph-structured data is difficult, as individual data points are interconnected (often non-linearly or sparsely). This connectivity complicates the computation of individual privacy loss in differentially private learning. The problem is exacerbated by an absence of a single, well-established formulation of DP in graph settings. This issue extends to the domain of GNNs, rendering private machine learning on graph-structured data a challenging task. A lack of prior systematisation work motivated us to study graph-based learning from a privacy perspective. In this work, we systematise different formulations of DP on graphs, discuss challenges and promising applications, including the GNN domain. We compare and separate works into graph analysis tasks and graph learning tasks with GNNs. Finally, we conclude our work with a discussion of open questions and potential directions for further research in this area.  ( 2 min )
    TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding. (arXiv:2203.09098v1 [cs.SD])
    Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.  ( 2 min )
    Few-Shot Learning on Graphs: A Survey. (arXiv:2203.09308v1 [cs.LG])
    Graph representation learning has attracted tremendous attention due to its remarkable performance in many real-world applications. However, prevailing (semi-)supervised graph representation learning models for specific tasks often suffer from label sparsity issue as data labeling is always time and resource consuming. In light of this, few-shot learning on graphs (FSLG), which combines the strengths of graph representation learning and few-shot learning together, has been proposed to tackle the performance degradation in face of limited annotated data challenge. There have been many studies working on FSLG recently. In this paper, we comprehensively survey these work in the form of a series of methods and applications. Specifically, we first introduce FSLG challenges and bases, then categorize and summarize existing work of FSLG in terms of three major graph mining tasks at different granularity levels, i.e., node, edge, and graph. Finally, we share our thoughts on some future research directions of FSLG. The authors of this survey have contributed significantly to the AI literature on FSLG over the last few years.  ( 2 min )
    Graph Augmentation Learning. (arXiv:2203.09020v1 [cs.LG])
    Graph Augmentation Learning (GAL) provides outstanding solutions for graph learning in handling incomplete data, noise data, etc. Numerous GAL methods have been proposed for graph-based applications such as social network analysis and traffic flow forecasting. However, the underlying reasons for the effectiveness of these GAL methods are still unclear. As a consequence, how to choose optimal graph augmentation strategy for a certain application scenario is still in black box. There is a lack of systematic, comprehensive, and experimentally validated guideline of GAL for scholars. Therefore, in this survey, we in-depth review GAL techniques from macro (graph), meso (subgraph), and micro (node/edge) levels. We further detailedly illustrate how GAL enhance the data quality and the model performance. The aggregation mechanism of augmentation strategies and graph learning models are also discussed by different application scenarios, i.e., data-specific, model-specific, and hybrid scenarios. To better show the outperformance of GAL, we experimentally validate the effectiveness and adaptability of different GAL strategies in different downstream tasks. Finally, we share our insights on several open issues of GAL, including heterogeneity, spatio-temporal dynamics, scalability, and generalization.  ( 2 min )
    3D-UCaps: 3D Capsules Unet for Volumetric Image Segmentation. (arXiv:2203.08965v1 [cs.CV])
    Medical image segmentation has been so far achieving promising results with Convolutional Neural Networks (CNNs). However, it is arguable that in traditional CNNs, its pooling layer tends to discard important information such as positions. Moreover, CNNs are sensitive to rotation and affine transformation. Capsule network is a data-efficient network design proposed to overcome such limitations by replacing pooling layers with dynamic routing and convolutional strides, which aims to preserve the part-whole relationships. Capsule network has shown a great performance in image recognition and natural language processing, but applications for medical image segmentation, particularly volumetric image segmentation, has been limited. In this work, we propose 3D-UCaps, a 3D voxel-based Capsule network for medical volumetric image segmentation. We build the concept of capsules into a CNN by designing a network with two pathways: the first pathway is encoded by 3D Capsule blocks, whereas the second pathway is decoded by 3D CNNs blocks. 3D-UCaps, therefore inherits the merits from both Capsule network to preserve the spatial relationship and CNNs to learn visual representation. We conducted experiments on various datasets to demonstrate the robustness of 3D-UCaps including iSeg-2017, LUNA16, Hippocampus, and Cardiac, where our method outperforms previous Capsule networks and 3D-Unets.  ( 2 min )
    FILM: Following Instructions in Language with Modular Methods. (arXiv:2110.07342v3 [cs.CL] UPDATED)
    Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46 %) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49 %). Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.  ( 2 min )
    PiDAn: A Coherence Optimization Approach for Backdoor Attack Detection and Mitigation in Deep Neural Networks. (arXiv:2203.09289v1 [cs.LG])
    Backdoor attacks impose a new threat in Deep Neural Networks (DNNs), where a backdoor is inserted into the neural network by poisoning the training dataset, misclassifying inputs that contain the adversary trigger. The major challenge for defending against these attacks is that only the attacker knows the secret trigger and the target class. The problem is further exacerbated by the recent introduction of "Hidden Triggers", where the triggers are carefully fused into the input, bypassing detection by human inspection and causing backdoor identification through anomaly detection to fail. To defend against such imperceptible attacks, in this work we systematically analyze how representations, i.e., the set of neuron activations for a given DNN when using the training data as inputs, are affected by backdoor attacks. We propose PiDAn, an algorithm based on coherence optimization purifying the poisoned data. Our analysis shows that representations of poisoned data and authentic data in the target class are still embedded in different linear subspaces, which implies that they show different coherence with some latent spaces. Based on this observation, the proposed PiDAn algorithm learns a sample-wise weight vector to maximize the projected coherence of weighted samples, where we demonstrate that the learned weight vector has a natural "grouping effect" and is distinguishable between authentic data and poisoned data. This enables the systematic detection and mitigation of backdoor attacks. Based on our theoretical analysis and experimental results, we demonstrate the effectiveness of PiDAn in defending against backdoor attacks that use different settings of poisoned samples on GTSRB and ILSVRC2012 datasets. Our PiDAn algorithm can detect more than 90% infected classes and identify 95% poisoned samples.  ( 2 min )
    Dynamic fracture of a bicontinuously nanostructured copolymer: A deep-learning analysis of big-data-generating experiment. (arXiv:2112.01971v2 [cond-mat.mtrl-sci] UPDATED)
    Here, we report measurements of detailed dynamic cohesive properties (DCPs) beyond the dynamic fracture toughness of a bicontinuously nanostructured copolymer, polyurea, under an extremely loading rate, from deep-learning analyses of a dynamic big-data-generating experiment. We first describe a new Dynamic Line-Image Shearing Interferometer (DL-ISI), which uses a streak camera to record optical fringes of displacement-gradient vs time profile along a line on sample's rear surface. This system enables us to detect crack initiation and growth processes in plate-impact experiments. Then, we present a convolutional neural network (CNN) based deep-learning framework, trained by extensive finite-element simulations, that inversely determines the accurate DCPs from the DL-ISI fringe images. For the measurements, plate-impact experiments were performed on a set of samples with a mid-plane crack. A Conditional Generative Adversarial Networks (cGAN) was employed first to reconstruct missing DL-ISI fringes with recorded partial DL-ISI fringes. Then, the CNN and a correlation method were applied to the fully reconstructed fringes to get the dynamic fracture toughness, 12.1kJ/m^2, cohesive strength, 302 MPa, and maximum cohesive separation, 80.5 um, within 0.4%, 2.7%, and 2.2% differences, respectively. For the first time, the DCPs of polyurea have been successfully obtained by the DL-ISI with the pre-trained CNN and correlation analyses of cGAN-reconstructed data sets. The dynamic cohesive strength is found to be nearly three times higher than the dynamic-failure-initiation strength. The high dynamic fracture toughness is found to stem from both high dynamic cohesive strength and high ductility of the dynamic cohesive separation.  ( 2 min )
    Orchestrated Value Mapping for Reinforcement Learning. (arXiv:2203.07171v2 [cs.LG] UPDATED)
    We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels. The first principle enables incorporating specific properties into the value estimator that can enhance learning. The second principle, on the other hand, allows for the value function to be represented as a composition of multiple utility functions. This can be leveraged for various purposes, e.g. dealing with highly varying reward scales, incorporating a priori knowledge about the sources of reward, and ensemble learning. Combining the two principles yields a general blueprint for instantiating convergent algorithms by orchestrating diverse mapping functions over multiple reward channels. This blueprint generalizes and subsumes algorithms such as Q-Learning, Log Q-Learning, and Q-Decomposition. In addition, our convergence proof for this general class relaxes certain required assumptions in some of these algorithms. Based on our theory, we discuss several interesting configurations as special cases. Finally, to illustrate the potential of the design space that our theory opens up, we instantiate a particular algorithm and evaluate its performance on the Atari suite.  ( 2 min )
    On the Properties of Adversarially-Trained CNNs. (arXiv:2203.09243v1 [cs.CV])
    Adversarial Training has proved to be an effective training paradigm to enforce robustness against adversarial examples in modern neural network architectures. Despite many efforts, explanations of the foundational principles underpinning the effectiveness of Adversarial Training are limited and far from being widely accepted by the Deep Learning community. In this paper, we describe surprising properties of adversarially-trained models, shedding light on mechanisms through which robustness against adversarial attacks is implemented. Moreover, we highlight limitations and failure modes affecting these models that were not discussed by prior works. We conduct extensive analyses on a wide range of architectures and datasets, performing a deep comparison between robust and natural models.  ( 2 min )
    Investigation of Physics-Informed Deep Learning for the Prediction of Parametric, Three-Dimensional Flow Based on Boundary Data. (arXiv:2203.09204v1 [cs.LG])
    The placement of temperature sensitive and safety-critical components is crucial in the automotive industry. It is therefore inevitable, even at the design stage of new vehicles that these components are assessed for potential safety issues. However, with increasing number of design proposals, risk assessment quickly becomes expensive. We therefore present a parameterized surrogate model for the prediction of three-dimensional flow fields in aerothermal vehicle simulations. The proposed physics-informed neural network (PINN) design is aimed at learning families of flow solutions according to a geometric variation. In scope of this work, we could show that our nondimensional, multivariate scheme can be efficiently trained to predict the velocity and pressure distribution for different design scenarios and geometric scales. The proposed algorithm is based on a parametric minibatch training which enables the utilization of large datasets necessary for the three-dimensional flow modeling. Further, we introduce a continuous resampling algorithm that allows to operate on one static dataset. Every feature of our methodology is tested individually and verified against conventional CFD simulations. Finally, we apply our proposed method in context of an exemplary real-world automotive application.  ( 2 min )
    Sensitivity Estimation for Dark Matter Subhalos in Synthetic Gaia DR2 using Deep Learning. (arXiv:2203.08161v1 [astro-ph.GA] CROSS LISTED)
    The abundance of dark matter subhalos orbiting a host galaxy is a generic prediction of the cosmological framework. It is a promising way to constrain the nature of dark matter. Here we describe the challenges of detecting stars whose phase-space distribution may be perturbed by the passage of dark matter subhalos using a machine learning approach. The training data are three Milky Way-like galaxies and nine synthetic Gaia DR2 surveys derived from these. We first quantify the magnitude of the perturbations in the simulated galaxies using an anomaly detection algorithm. We also estimate the feasibility of this approach in the Gaia DR2-like catalogues by comparing the anomaly detection based approach with a supervised classification. We find that a classification algorithm optimized on about half a billion synthetic star observables exhibits mild but nonzero sensitivity. This classification-based approach is not sufficiently sensitive to pinpoint the exact locations of subhalos in the simulation, as would be expected from the very limited number of subhalos in the detectable region. The enormous size of the Gaia dataset motivates the further development of scalable and accurate computational methods that could be used to select potential regions of interest for dark matter searches to ultimately constrain the Milky Way's subhalo mass function.  ( 2 min )
    Multitask Prompted Training Enables Zero-Shot Task Generalization. (arXiv:2110.08207v3 [cs.LG] UPDATED)
    Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.  ( 3 min )
    Unsupervised Instance Selection with Low-Label, Supervised Learning for Outlier Detection. (arXiv:2104.12837v2 [cs.LG] UPDATED)
    The laborious process of labeling data often bottlenecks projects that aim to leverage the power of supervised machine learning. Active Learning (AL) has been established as a technique to ameliorate this condition through an iterative framework that queries a human annotator for labels of instances with the most uncertain class assignment. Via this mechanism, AL produces a binary classifier trained on less labeled data but with little, if any, loss in predictive performance. Despite its advantages, AL can have difficulty with class-imbalanced datasets and results in an inefficient labeling process. To address these drawbacks, we investigate our unsupervised instance selection (UNISEL) technique followed by a Random Forest (RF) classifier on 10 outlier detection datasets under low-label conditions. These results are compared to AL performed on the same datasets. Further, we investigate the combination of UNISEL and AL. Results indicate that UNISEL followed by an RF performs comparably to AL with an RF and that the combination of UNISEL and AL demonstrates superior performance. The practical implications of these findings in terms of time savings and generalizability afforded by UNISEL are discussed.  ( 2 min )
    Semi-Markov Offline Reinforcement Learning for Healthcare. (arXiv:2203.09365v1 [cs.LG])
    Reinforcement learning (RL) tasks are typically framed as Markov Decision Processes (MDPs), assuming that decisions are made at fixed time intervals. However, many applications of great importance, including healthcare, do not satisfy this assumption, yet they are commonly modelled as MDPs after an artificial reshaping of the data. In addition, most healthcare (and similar) problems are offline by nature, allowing for only retrospective studies. To address both challenges, we begin by discussing the Semi-MDP (SMDP) framework, which formally handles actions of variable timings. We next present a formal way to apply SMDP modifications to nearly any given value-based offline RL method. We use this theory to introduce three SMDP-based offline RL algorithms, namely, SDQN, SDDQN, and SBCQ. We then experimentally demonstrate that these SMDP-based algorithms learn the optimal policy in these variable-time environments, whereas un-directed modifications of MDP modelling lead to sub-optimal policies. Finally, we apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention and demonstrate similar results.  ( 2 min )
    Progressive Subsampling for Oversampled Data -- Application to Quantitative MRI. (arXiv:2203.09268v1 [eess.IV])
    We present PROSUB: PROgressive SUBsampling, a deep learning based, automated methodology that subsamples an oversampled data set (e.g. multi-channeled 3D images) with minimal loss of information. We build upon a recent dual-network approach that won the MICCAI MUlti-DIffusion (MUDI) quantitative MRI measurement sampling-reconstruction challenge, but suffers from deep learning training instability, by subsampling with a hard decision boundary. PROSUB uses the paradigm of recursive feature elimination (RFE) and progressively subsamples measurements during deep learning training, improving optimization stability. PROSUB also integrates a neural architecture search (NAS) paradigm, allowing the network architecture hyperparameters to respond to the subsampling process. We show PROSUB outperforms the winner of the MUDI MICCAI challenge, producing large improvements >18% MSE on the MUDI challenge sub-tasks and qualitative improvements on downstream processes useful for clinical applications. We also show the benefits of incorporating NAS and analyze the effect of PROSUB's components. As our method generalizes to other problems beyond MRI measurement selection-reconstruction, our code is https://github.com/sbb-gh/PROSUB  ( 2 min )
    Learning from humans: combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task. (arXiv:2203.06250v2 [cs.LG] UPDATED)
    We develop a method to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL), using on-policy algorithms that are more suitable to learn from pre-trained networks. We show that the combination of IL and RL can match human results and that good performance strongly depends on an egocentric representation of the environment. The developed methodology can be used to efficiently learn policies for unmanned vehicles which have to solve missions in an open field environment.  ( 2 min )
    A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning. (arXiv:2110.08465v3 [cs.LG] UPDATED)
    Here, we present a Heterogeneous Graph neural network for Multimodal neuroimaging fusion learning (HGM). Traditional GNN-based models usually assume the brain network is a homogeneous graph with single type of nodes and edges. However, vast literatures have shown the heterogeneity of the human brain especially between the two hemispheres. Homogeneous brain network is insufficient to model the complicated brain state. Therefore, in this work we firstly model the brain network as heterogeneous graph with multi-type nodes (i.e., left and right hemispheric nodes) and multi-type edges (i.e., intra- and inter-hemispheric edges). Besides, we also propose a self-supervised pre-training strategy based on heterogeneou brain network to address the overfitting problem due to the complex model and small sample size. Our results on two datasets show the superiority of proposed model over other multimodal methods for disease prediction task. Besides, ablation experiments show that our model with pre-training strategy can alleviate the problem of limited training sample size.  ( 2 min )
    Graph Representation Learning with Individualization and Refinement. (arXiv:2203.09141v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as prominent models for representation learning on graph structured data. GNNs follow an approach of message passing analogous to 1-dimensional Weisfeiler Lehman (1-WL) test for graph isomorphism and consequently are limited by the distinguishing power of 1-WL. More expressive higher-order GNNs which operate on k-tuples of nodes need increased computational resources in order to process higher-order tensors. Instead of the WL approach, in this work, we follow the classical approach of Individualization and Refinement (IR), a technique followed by most practical isomorphism solvers. Individualization refers to artificially distinguishing a node in the graph and refinement is the propagation of this information to other nodes through message passing. We learn to adaptively select nodes to individualize and to aggregate the resulting graphs after refinement to help handle the complexity. Our technique lets us learn richer node embeddings while keeping the computational complexity manageable. Theoretically, we show that our procedure is more expressive than the 1-WL test. Experiments show that our method outperforms prominent 1-WL GNN models as well as competitive higher-order baselines on several benchmark synthetic and real datasets. Furthermore, our method opens new doors for exploring the paradigm of learning on graph structures with individualization and refinement.  ( 2 min )
    On the Convergence of Certified Robust Training with Interval Bound Propagation. (arXiv:2203.08961v1 [cs.LG])
    Interval Bound Propagation (IBP) is so far the base of state-of-the-art methods for training neural networks with certifiable robustness guarantees when potential adversarial perturbations present, while the convergence of IBP training remains unknown in existing literature. In this paper, we present a theoretical analysis on the convergence of IBP training. With an overparameterized assumption, we analyze the convergence of IBP robust training. We show that when using IBP training to train a randomly initialized two-layer ReLU neural network with logistic loss, gradient descent can linearly converge to zero robust training error with a high probability if we have sufficiently small perturbation radius and large network width.
    CKConv: Continuous Kernel Convolution For Sequential Data. (arXiv:2102.02611v3 [cs.LG] UPDATED)
    Conventional neural architectures for sequential data present important limitations. Recurrent networks suffer from exploding and vanishing gradients, small effective memory horizons, and must be trained sequentially. Convolutional networks are unable to handle sequences of unknown size and their memory horizon must be defined a priori. In this work, we show that all these problems can be solved by formulating convolutional kernels in CNNs as continuous functions. The resulting Continuous Kernel Convolution (CKConv) allows us to model arbitrarily long sequences in a parallel manner, within a single operation, and without relying on any form of recurrence. We show that Continuous Kernel Convolutional Networks (CKCNNs) obtain state-of-the-art results in multiple datasets, e.g., permuted MNIST, and, thanks to their continuous nature, are able to handle non-uniformly sampled datasets and irregularly-sampled data natively. CKCNNs match or perform better than neural ODEs designed for these purposes in a faster and simpler manner.  ( 2 min )
    An Interactive Explanatory AI System for Industrial Quality Control. (arXiv:2203.09181v1 [cs.LG])
    Machine learning based image classification algorithms, such as deep neural network approaches, will be increasingly employed in critical settings such as quality control in industry, where transparency and comprehensibility of decisions are crucial. Therefore, we aim to extend the defect detection task towards an interactive human-in-the-loop approach that allows us to integrate rich background knowledge and the inference of complex relationships going beyond traditional purely data-driven approaches. We propose an approach for an interactive support system for classifications in an industrial quality control setting that combines the advantages of both (explainable) knowledge-driven and data-driven machine learning methods, in particular inductive logic programming and convolutional neural networks, with human expertise and control. The resulting system can assist domain experts with decisions, provide transparent explanations for results, and integrate feedback from users; thus reducing workload for humans while both respecting their expertise and without removing their agency or accountability.  ( 2 min )
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v1 [q-bio.NC])
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.  ( 3 min )
    Recurrent Neural Networks for Forecasting Time Series with Multiple Seasonality: A Comparative Study. (arXiv:2203.09170v1 [cs.LG])
    This paper compares recurrent neural networks (RNNs) with different types of gated cells for forecasting time series with multiple seasonality. The cells we compare include classical long short term memory (LSTM), gated recurrent unit (GRU), modified LSTM with dilation, and two new cells we proposed recently, which are equipped with dilation and attention mechanisms. To model the temporal dependencies of different scales, our RNN architecture has multiple dilated recurrent layers stacked with hierarchical dilations. The proposed RNN produces both point forecasts and predictive intervals (PIs) for them. An empirical study concerning short-term electrical load forecasting for 35 European countries confirmed that the new gated cells with dilation and attention performed best.  ( 2 min )
    DetMatch: Two Teachers are Better Than One for Joint 2D and 3D Semi-Supervised Object Detection. (arXiv:2203.09510v1 [cs.CV])
    While numerous 3D detection works leverage the complementary relationship between RGB images and point clouds, developments in the broader framework of semi-supervised object recognition remain uninfluenced by multi-modal fusion. Current methods develop independent pipelines for 2D and 3D semi-supervised learning despite the availability of paired image and point cloud frames. Observing that the distinct characteristics of each sensor cause them to be biased towards detecting different objects, we propose DetMatch, a flexible framework for joint semi-supervised learning on 2D and 3D modalities. By identifying objects detected in both sensors, our pipeline generates a cleaner, more robust set of pseudo-labels that both demonstrates stronger performance and stymies single-modality error propagation. Further, we leverage the richer semantics of RGB images to rectify incorrect 3D class predictions and improve localization of 3D boxes. Evaluating on the challenging KITTI and Waymo datasets, we improve upon strong semi-supervised learning methods and observe higher quality pseudo-labels. Code will be released at https://github.com/Divadi/DetMatch  ( 2 min )
    Reachability analysis of neural networks using mixed monotonicity. (arXiv:2111.07683v2 [eess.SY] UPDATED)
    This paper presents a new reachability analysis approach to compute interval over-approximations of the output set of feedforward neural networks with input uncertainty. We adapt to neural networks an existing mixed-monotonicity method for the reachability analysis of dynamical systems and apply it to each partial network within the main network. This ensures that the intersection of the obtained results is the tightest interval over-approximation of the output of each layer that can be obtained using mixed-monotonicity on any partial network decomposition. Unlike other tools in the literature focusing on small classes of piecewise-affine or monotone activation functions, the main strength of our approach is its generality: it can handle neural networks with any Lipschitz-continuous activation function. In addition, the simplicity of our framework allows users to very easily add unimplemented activation functions, by simply providing the function, its derivative and the global argmin and argmax of the derivative. Our algorithm is compared to five other interval-based tools (Interval Bound Propagation, ReluVal, Neurify, VeriNet, CROWN) on both existing benchmarks and two sets of small and large randomly generated networks for four activation functions (ReLU, TanH, ELU, SiLU).  ( 2 min )
    Calibration of P-values for calibration and for deviation of a subpopulation from the full population. (arXiv:2202.00100v3 [stat.ME] UPDATED)
    The author's recent research papers, "Cumulative deviation of a subpopulation from the full population" and "A graphical method of cumulative differences between two subpopulations" (both published in volume 8 of Springer's open-access "Journal of Big Data" during 2021), propose graphical methods and summary statistics, without extensively calibrating formal significance tests. The summary metrics and methods can measure the calibration of probabilistic predictions and can assess differences in responses between a subpopulation and the full population while controlling for a covariate or score via conditioning on it. These recently published papers construct significance tests based on the scalar summary statistics, but only sketch how to calibrate the attained significance levels (also known as "P-values") for the tests. The present article reviews and synthesizes work spanning many decades in order to detail how to calibrate the P-values. The present paper presents computationally efficient, easily implemented numerical methods for evaluating properly calibrated P-values, together with rigorous mathematical proofs guaranteeing their accuracy, and illustrates and validates the methods with open-source software and numerical examples.  ( 2 min )
    Meta-Learning of NAS for Few-shot Learning in Medical Image Applications. (arXiv:2203.08951v1 [cs.LG])
    Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre-/post-procedures. Even though it has been shown that network architecture plays a critical role in learning feature representation feature from data and the final performance, searching for the best network architecture is computationally intensive and heavily relies on researchers' experience. Automated machine learning (AutoML) and its advanced techniques i.e. Neural Architecture Search (NAS) have been promoted to address those limitations. Not only in general computer vision tasks, but NAS has also motivated various applications in multiple areas including medical imaging. In medical imaging, NAS has significant progress in improving the accuracy of image classification, segmentation, reconstruction, and more. However, NAS requires the availability of large annotated data, considerable computation resources, and pre-defined tasks. To address such limitations, meta-learning has been adopted in the scenarios of few-shot learning and multiple tasks. In this book chapter, we first present a brief review of NAS by discussing well-known approaches in search space, search strategy, and evaluation strategy. We then introduce various NAS approaches in medical imaging with different applications such as classification, segmentation, detection, reconstruction, etc. Meta-learning in NAS for few-shot learning and multiple tasks is then explained. Finally, we describe several open problems in NAS.  ( 2 min )
    Mixing Up Contrastive Learning: Self-Supervised Representation Learning for Time Series. (arXiv:2203.09270v1 [stat.ML])
    The lack of labeled data is a key challenge for learning useful representation from time series data. However, an unsupervised representation framework that is capable of producing high quality representations could be of great value. It is key to enabling transfer learning, which is especially beneficial for medical applications, where there is an abundance of data but labeling is costly and time consuming. We propose an unsupervised contrastive learning framework that is motivated from the perspective of label smoothing. The proposed approach uses a novel contrastive loss that naturally exploits a data augmentation scheme in which new samples are generated by mixing two data samples with a mixing component. The task in the proposed framework is to predict the mixing component, which is utilized as soft targets in the loss function. Experiments demonstrate the framework's superior performance compared to other representation learning approaches on both univariate and multivariate time series and illustrate its benefits for transfer learning for clinical time series.  ( 2 min )
    Topological Graph Neural Networks. (arXiv:2102.07835v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive (in terms the Weisfeiler--Lehman graph isomorphism test) than message-passing GNNs. Augmenting GNNs with TOGL leads to improved predictive performance for graph and node classification tasks, both on synthetic data sets, which can be classified by humans using their topology but not by ordinary GNNs, and on real-world data.  ( 2 min )
    Adversarial Support Alignment. (arXiv:2203.08908v1 [cs.LG])
    We study the problem of aligning the supports of distributions. Compared to the existing work on distribution alignment, support alignment does not require the densities to be matched. We propose symmetric support difference as a divergence measure to quantify the mismatch between supports. We show that select discriminators (e.g. discriminator trained for Jensen-Shannon divergence) are able to map support differences as support differences in their one-dimensional output space. Following this result, our method aligns supports by minimizing a symmetrized relaxed optimal transport cost in the discriminator 1D space via an adversarial process. Furthermore, we show that our approach can be viewed as a limit of existing notions of alignment by increasing transportation assignment tolerance. We quantitatively evaluate the method across domain adaptation tasks with shifts in label distributions. Our experiments show that the proposed method is more robust against these shifts than other alignment-based baselines.
    Pure Exploration in Kernel and Neural Bandits. (arXiv:2106.12034v2 [stat.ML] UPDATED)
    We study pure exploration in bandits, where the dimension of the feature representation can be much larger than the number of arms. To overcome the curse of dimensionality, we propose to adaptively embed the feature representation of each arm into a lower-dimensional space and carefully deal with the induced model misspecification. Our approach is conceptually very different from existing works that can either only handle low-dimensional linear bandits or passively deal with model misspecification. We showcase the application of our approach to two pure exploration settings that were previously under-studied: (1) the reward function belongs to a possibly infinite-dimensional Reproducing Kernel Hilbert Space, and (2) the reward function is nonlinear and can be approximated by neural networks. Our main results provide sample complexity guarantees that only depend on the effective dimension of the feature spaces in the kernel or neural representations. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the efficacy of our methods.  ( 2 min )
    Prediction of speech intelligibility with DNN-based performance measures. (arXiv:2203.09148v1 [cs.SD])
    This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. This model does not require the clean speech reference nor the word labels during testing as the ASR decoding step, which finds the most likely sequence of words given phoneme posterior probabilities, is omitted. The model is evaluated via the root-mean-squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with eight noise maskers covering different modulation types, from speech-shaped stationary noise to a single-talker masker. The prediction performance is compared to five established models and an ASR-model using word labels. Two combinations of features and networks were tested. Both include temporal information either at the feature level (amplitude modulation filterbanks and a feed-forward network) or captured by the architecture (mel-spectrograms and a time-delay deep neural network, TDNN). The TDNN model is on par with the DNN while reducing the number of parameters by a factor of 37; this optimization allows parallel streams on dedicated hearing aid hardware as a forward-pass can be computed within the 10ms of each frame. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.  ( 2 min )
    Intrinsic Neural Fields: Learning Functions on Manifolds. (arXiv:2203.07967v2 [cs.CV] UPDATED)
    Neural fields have gained significant attention in the computer vision community due to their excellent performance in novel view synthesis, geometry reconstruction, and generative modeling. Some of their advantages are a sound theoretic foundation and an easy implementation in current deep learning frameworks. While neural fields have been applied to signals on manifolds, e.g., for texture reconstruction, their representation has been limited to extrinsically embedding the shape into Euclidean space. The extrinsic embedding ignores known intrinsic manifold properties and is inflexible wrt. transfer of the learned function. To overcome these limitations, this work introduces intrinsic neural fields, a novel and versatile representation for neural fields on manifolds. Intrinsic neural fields combine the advantages of neural fields with the spectral properties of the Laplace-Beltrami operator. We show theoretically that intrinsic neural fields inherit many desirable properties of the extrinsic neural field framework but exhibit additional intrinsic qualities, like isometry invariance. In experiments, we show intrinsic neural fields can reconstruct high-fidelity textures from images with state-of-the-art quality and are robust to the discretization of the underlying manifold. We demonstrate the versatility of intrinsic neural fields by tackling various applications: texture transfer between deformed shapes & different shapes, texture reconstruction from real-world images with view dependence, and discretization-agnostic learning on meshes and point clouds.  ( 2 min )
    Toward the Detection of Polyglot Files. (arXiv:2203.07561v2 [cs.CR] UPDATED)
    Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.34%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.  ( 2 min )
    Diffusion Probabilistic Modeling for Video Generation. (arXiv:2203.09481v1 [cs.CV])
    Denoising diffusion probabilistic models are a promising new class of generative models that are competitive with GANs on perceptual metrics. In this paper, we explore their potential for sequentially generating video. Inspired by recent advances in neural video compression, we use denoising diffusion models to stochastically generate a residual to a deterministic next-frame prediction. We compare this approach to two sequential VAE and two GAN baselines on four datasets, where we test the generated frames for perceptual quality and forecasting accuracy against ground truth frames. We find significant improvements in terms of perceptual quality on all data and improvements in terms of frame forecasting for complex high-resolution videos.  ( 2 min )
    Context-Dependent Anomaly Detection with Knowledge Graph Embedding Models. (arXiv:2203.09354v1 [cs.LG])
    Increasing the semantic understanding and contextual awareness of machine learning models is important for improving robustness and reducing susceptibility to data shifts. In this work, we leverage contextual awareness for the anomaly detection problem. Although graphed-based anomaly detection has been widely studied, context-dependent anomaly detection is an open problem and without much current research. We develop a general framework for converting a context-dependent anomaly detection problem to a link prediction problem, allowing well-established techniques from this domain to be applied. We implement a system based on our framework that utilizes knowledge graph embedding models and demonstrates the ability to detect outliers using context provided by a semantic knowledge base. We show that our method can detect context-dependent anomalies with a high degree of accuracy and show that current object detectors can detect enough classes to provide the needed context for good performance within our example domain.  ( 2 min )
    Error estimates for physics informed neural networks approximating the Navier-Stokes equations. (arXiv:2203.09346v1 [math.NA])
    We prove rigorous bounds on the errors resulting from the approximation of the incompressible Navier-Stokes equations with (extended) physics informed neural networks. We show that the underlying PDE residual can be made arbitrarily small for tanh neural networks with two hidden layers. Moreover, the total error can be estimated in terms of the training error, network size and number of quadrature points. The theory is illustrated with numerical experiments.  ( 2 min )
    A Decomposition-Based Hybrid Ensemble CNN Framework for Improving Cross-Subject EEG Decoding Performance. (arXiv:2203.09477v1 [eess.SP])
    Electroencephalogram (EEG) signals are complex, non-linear, and non-stationary in nature. However, previous studies that applied decomposition to minimize the complexity mainly exploited the hand-engineering features, limiting the information learned in EEG decoding. Therefore, extracting additional primary features from different disassembled components to improve the EEG-based recognition performance remains challenging. On the other hand, attempts have been made to use a single model to learn the hand-engineering features. Less work has been done to improve the generalization ability through ensemble learning. In this work, we propose a novel decomposition-based hybrid ensemble convolutional neural network (CNN) framework to enhance the capability of decoding EEG signals. CNNs, in particular, automatically learn the primary features from raw disassembled components but not handcraft features. The first option is to fuse the obtained score before the Softmax layer and execute back-propagation on the entire ensemble network, whereas the other is to fuse the probability output of the Softmax layer. Moreover, a component-specific batch normalization (CSBN) layer is employed to reduce subject variability. Against the challenging cross-subject driver fatigue-related situation awareness (SA) recognition task, eight models are proposed under the framework, which all showed superior performance than the strong baselines. The performance of different decomposition methods and ensemble modes were further compared. Results indicated that discrete wavelet transform (DWT)-based ensemble CNN achieves the best 82.11% among the proposed models. Our framework can be simply extended to any CNN architecture and applied in any EEG-related sectors, opening the possibility of extracting more preliminary information from complex EEG data.  ( 2 min )
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v2 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained interpolating a pre-trained global model with a $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.  ( 2 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v1 [cs.LG] CROSS LISTED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v4 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Augmented Sliced Wasserstein Distances. (arXiv:2006.08812v7 [cs.LG] UPDATED)
    While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through the random projection, yet they suffer from low accuracy if the number of projections is not sufficiently large, because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.  ( 2 min )
    Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow. (arXiv:2107.00204v2 [cs.LG] UPDATED)
    For marketing, we sometimes need to recommend content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.  ( 2 min )
    The Frost Hollow Experiments: Pavlovian Signalling as a Path to Coordination and Communication Between Agents. (arXiv:2203.09498v1 [cs.AI])
    Learned communication between agents is a powerful tool when approaching decision-making problems that are hard to overcome by any single agent in isolation. However, continual coordination and communication learning between machine agents or human-machine partnerships remains a challenging open problem. As a stepping stone toward solving the continual communication learning problem, in this paper we contribute a multi-faceted study into what we term Pavlovian signalling -- a process by which learned, temporally extended predictions made by one agent inform decision-making by another agent with different perceptual access to their shared environment. We seek to establish how different temporal processes and representational choices impact Pavlovian signalling between learning agents. To do so, we introduce a partially observable decision-making domain we call the Frost Hollow. In this domain a prediction learning agent and a reinforcement learning agent are coupled into a two-part decision-making system that seeks to acquire sparse reward while avoiding time-conditional hazards. We evaluate two domain variations: 1) machine prediction and control learning in a linear walk, and 2) a prediction learning machine interacting with a human participant in a virtual reality environment. Our results showcase the speed of learning for Pavlovian signalling, the impact that different temporal representations do (and do not) have on agent-agent coordination, and how temporal aliasing impacts agent-agent and human-agent interactions differently. As a main contribution, we establish Pavlovian signalling as a natural bridge between fixed signalling paradigms and fully adaptive communication learning. Our results therefore point to an actionable, constructivist path towards continual communication learning between reinforcement learning agents, with potential impact in a range of real-world settings.  ( 3 min )
    Uncertainty with UAV Search of Multiple Goal-oriented Targets. (arXiv:2203.09476v1 [cs.RO])
    This paper considers the complex problem of a team of UAVs searching targets under uncertainty. The goal of the UAV team is to find all of the moving targets as quickly as possible before they arrive at their selected goal. The uncertainty considered is threefold: First, the UAVs do not know the targets' locations and destinations. Second, the sensing capabilities of the UAVs are not perfect. Third, the targets' movement model is unknown. We suggest a real-time algorithmic framework for the UAVs, combining entropy and stochastic-temporal belief, that aims at optimizing the probability of a quick and successful detection of all of the targets. We have empirically evaluated the algorithmic framework, and have shown its efficiency and significant performance improvement compared to other solutions. Furthermore, we have evaluated our framework using Peer Designed Agents (PDAs), which are computer agents that simulate targets and show that our algorithmic framework outperforms other solutions in this scenario.  ( 2 min )
    Adaptive Graph Auto-Encoder for General Data Clustering. (arXiv:2002.08648v5 [cs.LG] UPDATED)
    Graph-based clustering plays an important role in the clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. However, in general clustering tasks, the graph structure of data does not exist such that the strategy to construct a graph is crucial for performance. Therefore, how to extend graph convolution networks into general clustering tasks is an attractive problem. In this paper, we propose a graph auto-encoder for general data clustering, which constructs the graph adaptively according to the generative perspective of graphs. The adaptive process is designed to induce the model to exploit the high-level information behind data and utilize the non-Euclidean structure sufficiently. We further design a novel mechanism with rigorous analysis to avoid the collapse caused by the adaptive construction. Via combining the generative model for network embedding and graph-based clustering, a graph auto-encoder with a novel decoder is developed such that it performs well in weighted graph used scenarios. Extensive experiments prove the superiority of our model.  ( 2 min )
    FourierMask: Instance Segmentation using Fourier Mapping in Implicit Neural Networks. (arXiv:2112.12535v2 [cs.CV] UPDATED)
    We present FourierMask, which employs Fourier series combined with implicit neural representations to generate instance segmentation masks. We apply a Fourier mapping (FM) to the coordinate locations and utilize the mapped features as inputs to an implicit representation (coordinate-based multi-layer perceptron (MLP)). FourierMask learns to predict the coefficients of the FM for a particular instance, and therefore adapts the FM to a specific object. This allows FourierMask to be generalized to predict instance segmentation masks from natural images. Since implicit functions are continuous in the domain of input coordinates, we illustrate that by sub-sampling the input pixel coordinates, we can generate higher resolution masks during inference. Furthermore, we train a renderer MLP (FourierRend) on the uncertain predictions of FourierMask and illustrate that it significantly improves the quality of the masks. FourierMask shows competitive results on the MS COCO dataset compared to the baseline Mask R-CNN at the same output resolution and surpasses it on higher resolution.  ( 2 min )
    GATE: Graph CCA for Temporal SElf-supervised Learning for Label-efficient fMRI Analysis. (arXiv:2203.09034v1 [cs.LG])
    In this work, we focus on the challenging task, neuro-disease classification, using functional magnetic resonance imaging (fMRI). In population graph-based disease analysis, graph convolutional neural networks (GCNs) have achieved remarkable success. However, these achievements are inseparable from abundant labeled data and sensitive to spurious signals. To improve fMRI representation learning and classification under a label-efficient setting, we propose a novel and theory-driven self-supervised learning (SSL) framework on GCNs, namely Graph CCA for Temporal self-supervised learning on fMRI analysis GATE. Concretely, it is demanding to design a suitable and effective SSL strategy to extract formation and robust features for fMRI. To this end, we investigate several new graph augmentation strategies from fMRI dynamic functional connectives (FC) for SSL training. Further, we leverage canonical-correlation analysis (CCA) on different temporal embeddings and present the theoretical implications. Consequently, this yields a novel two-step GCN learning procedure comprised of (i) SSL on an unlabeled fMRI population graph and (ii) fine-tuning on a small labeled fMRI dataset for a classification task. Our method is tested on two independent fMRI datasets, demonstrating superior performance on autism and dementia diagnosis.  ( 2 min )
    Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup. (arXiv:2012.15511v2 [cs.LG] UPDATED)
    Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including its non-asymptotic analysis and the performance gain of parallelism (a.k.a. linear speedup). This paper revisits the A3C algorithm and establishes its non-asymptotic convergence guarantees. Under both i.i.d. and Markovian sampling, we establish the local convergence guarantee for A3C in the general policy approximation case and the global convergence guarantee in softmax policy parameterization. Under i.i.d. sampling, A3C obtains sample complexity of $\mathcal{O}(\epsilon^{-2.5}/N)$ per worker to achieve $\epsilon$ accuracy, where $N$ is the number of workers. Compared to the best-known sample complexity of $\mathcal{O}(\epsilon^{-2.5})$ for two-timescale AC, A3C achieves \emph{linear speedup}, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetic environment, OpenAI Gym environments and Atari games have been provided to verify our theoretical analysis.  ( 2 min )
    Bandit Labor Training. (arXiv:2006.06853v5 [cs.LG] UPDATED)
    On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs. Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones. However, the learning rates of novice workers are unknown, resulting in a tradeoff between exploration (learning the learning rates) and exploitation (training the best workers). Motivated to study this tradeoff, we analyze a novel objective within the stochastic multi-armed bandit framework. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest cumulative reward (the "max" objective). When rewards represent skill increments, this corresponds to the objective of training a single highly skilled worker from a set of novice workers, using a limited supply of training jobs. For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant) and a worst-case regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We generalize our algorithmic insights to the problem of maximizing the expected value of the average cumulative reward of the top $m$ arms with the highest cumulative rewards, corresponding to the case where multiple workers must be trained. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes.  ( 3 min )
    Image Super-Resolution With Deep Variational Autoencoders. (arXiv:2203.09445v1 [cs.CV])
    Image super-resolution (SR) techniques are used to generate a high-resolution image from a low-resolution image. Until now, deep generative models such as autoregressive models and Generative Adversarial Networks (GANs) have proven to be effective at modelling high-resolution images. Models based on Variational Autoencoders (VAEs) have often been criticized for their feeble generative performance, but with new advancements such as VDVAE (very deep VAE), there is now strong evidence that deep VAEs have the potential to outperform current state-of-the-art models for high-resolution image generation. In this paper, we introduce VDVAE-SR, a new model that aims to exploit the most recent deep VAE methodologies to improve upon image super-resolution using transfer learning on pretrained VDVAEs. Through qualitative and quantitative evaluations, we show that the proposed model is competitive with other state-of-the-art methods.  ( 2 min )
    The Structured Abstain Problem and the Lov\'asz Hinge. (arXiv:2203.08645v2 [cs.LG] UPDATED)
    The Lov\'asz hinge is a convex surrogate recently proposed for structured binary classification, in which $k$ binary predictions are made simultaneously and the error is judged by a submodular set function. Despite its wide usage in image segmentation and related problems, its consistency has remained open. We resolve this open question, showing that the Lov\'asz hinge is inconsistent for its desired target unless the set function is modular. Leveraging a recent embedding framework, we instead derive the target loss for which the Lov\'asz hinge is consistent. This target, which we call the structured abstain problem, allows one to abstain on any subset of the $k$ predictions. We derive two link functions, each of which are consistent for all submodular set functions simultaneously.  ( 2 min )
    SC2: Supervised Compression for Split Computing. (arXiv:2203.08875v1 [cs.LG])
    Split computing distributes the execution of a neural network (e.g., for a classification task) between a mobile device and a more powerful edge server. A simple alternative to splitting the network is to carry out the supervised task purely on the edge server while compressing and transmitting the full data, and most approaches have barely outperformed this baseline. This paper proposes a new approach for discretizing and entropy-coding intermediate feature activations to efficiently transmit them from the mobile device to the edge server. We show that a efficient splittable network architecture results from a three-way tradeoff between (a) minimizing the computation on the mobile device, (b) minimizing the size of the data to be transmitted, and (c) maximizing the model's prediction performance. We propose an architecture based on this tradeoff and train the splittable network and entropy model in a knowledge distillation framework. In an extensive set of experiments involving three vision tasks, three datasets, nine baselines, and more than 180 trained models, we show that our approach improves supervised rate-distortion tradeoffs while maintaining a considerably smaller encoder size. We also release sc2bench, an installable Python package, to encourage and facilitate future studies on supervised compression for split computing (SC2).  ( 2 min )
    Dimensionality Reduction and Wasserstein Stability for Kernel Regression. (arXiv:2203.09347v1 [stat.ML])
    In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable. More specifically we combine principal component analysis (PCA) with kernel regression. In order to analyze the resulting regression errors, a novel stability result of kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit a kernel function. We combine the stability result with known estimates from the literature on both principal component analysis and kernel regression to obtain convergence rates for the two-step procedure.  ( 2 min )
    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v1 [cs.LG])
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.  ( 2 min )
    Training Structured Neural Networks Through Manifold Identification and Variance Reduction. (arXiv:2112.02612v2 [cs.LG] UPDATED)
    This paper proposes an algorithm (RMDA) for training neural networks (NNs) with a regularization term for promoting desired structures. RMDA does not incur computation additional to proximal SGD with momentum, and achieves variance reduction without requiring the objective function to be of the finite-sum form. Through the tool of manifold identification from nonlinear optimization, we prove that after a finite number of iterations, all iterates of RMDA possess a desired structure identical to that induced by the regularizer at the stationary point of asymptotic convergence, even in the presence of engineering tricks like data augmentation and dropout that complicate the training process. Experiments on training NNs with structured sparsity confirm that variance reduction is necessary for such an identification, and show that RMDA thus significantly outperforms existing methods for this task. For unstructured sparsity, RMDA also outperforms a state-of-the-art pruning method, validating the benefits of training structured NNs through regularization.  ( 2 min )
    Transframer: Arbitrary Frame Prediction with Generative Models. (arXiv:2203.09494v1 [cs.CV])
    We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.  ( 2 min )
    Generating Novel Scene Compositions from Single Images and Videos. (arXiv:2103.13389v3 [cs.CV] UPDATED)
    Given a large dataset for training, GANs can achieve remarkable performance for the image synthesis task. However, training GANs in extremely low data regimes remains a challenge, as overfitting often occurs, leading to memorization or training divergence. In this work, we introduce SIV-GAN, an unconditional generative model that can generate new scene compositions from a single training image or a single video clip. We propose a two-branch discriminator architecture, with content and layout branches designed to judge internal content and scene layout realism separately from each other. This discriminator design enables synthesis of visually plausible, novel compositions of a scene, with varying content and layout, while preserving the context of the original sample. Compared to previous single-image GANs, our model generates more diverse, higher quality images, while not being restricted to a single image setting. We show that SIV-GAN successfully deals with a new challenging task of learning from a single video, for which prior GAN models fail to achieve synthesis of both high quality and diversity.  ( 2 min )
    Towards Personalized Federated Learning. (arXiv:2103.00710v3 [cs.LG] UPDATED)
    In parallel with the rapid adoption of Artificial Intelligence (AI) empowered by advances in AI research, there have been growing awareness and concerns of data privacy. Recent significant developments in the data regulation landscape have prompted a seismic shift in interest towards privacy-preserving AI. This has contributed to the popularity of Federated Learning (FL), the leading paradigm for the training of machine learning models on data silos in a privacy-preserving manner. In this survey, we explore the domain of Personalized FL (PFL) to address the fundamental challenges of FL on heterogeneous data, a universal characteristic inherent in all real-world datasets. We analyze the key motivations for PFL and present a unique taxonomy of PFL techniques categorized according to the key challenges and personalization strategies in PFL. We highlight their key ideas, challenges and opportunities and envision promising future trajectories of research towards new PFL architectural design, realistic PFL benchmarking, and trustworthy PFL approaches.  ( 2 min )
    Explainability in Graph Neural Networks: An Experimental Survey. (arXiv:2203.09258v1 [cs.LG])
    Graph neural networks (GNNs) have been extensively developed for graph representation learning in various application domains. However, similar to all other neural networks models, GNNs suffer from the black-box problem as people cannot understand the mechanism underlying them. To solve this problem, several GNN explainability methods have been proposed to explain the decisions made by GNNs. In this survey, we give an overview of the state-of-the-art GNN explainability methods and how they are evaluated. Furthermore, we propose a new evaluation metric and conduct thorough experiments to compare GNN explainability methods on real world datasets. We also suggest future directions for GNN explainability.  ( 2 min )
    The Mathematics of Artificial Intelligence. (arXiv:2203.08890v1 [cs.LG])
    We currently witness the spectacular success of artificial intelligence in both science and public life. However, the development of a rigorous mathematical foundation is still at an early stage. In this survey article, which is based on an invited lecture at the International Congress of Mathematicians 2022, we will in particular focus on the current "workhorse" of artificial intelligence, namely deep neural networks. We will present the main theoretical directions along with several exemplary results and discuss key open problems.  ( 2 min )
    On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers. (arXiv:2203.08958v1 [cs.LG])
    Every uncalibrated classifier has a corresponding true calibration map that calibrates its confidence. Deviations of this idealistic map from the identity map reveal miscalibration. Such calibration errors can be reduced with many post-hoc calibration methods which fit some family of calibration maps on a validation dataset. In contrast, evaluation of calibration with the expected calibration error (ECE) on the test set does not explicitly involve fitting. However, as we demonstrate, ECE can still be viewed as if fitting a family of functions on the test data. This motivates the fit-on-the-test view on evaluation: first, approximate a calibration map on the test data, and second, quantify its distance from the identity. Exploiting this view allows us to unlock missed opportunities: (1) use the plethora of post-hoc calibration methods for evaluating calibration; (2) tune the number of bins in ECE with cross-validation. Furthermore, we introduce: (3) benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and (4) novel calibration and evaluation methods using new calibration map families PL and PL3.  ( 2 min )
    Self-Normalized Density Map (SNDM) for Counting Microbiological Objects. (arXiv:2203.09474v1 [cs.CV])
    The statistical properties of the density map (DM) approach to counting microbiological objects on images are studied in detail. The DM is given by U$^2$-Net. Two statistical methods for deep neural networks are utilized: the bootstrap and the Monte Carlo (MC) dropout. The detailed analysis of the uncertainties for the DM predictions leads to a deeper understanding of the DM model's deficiencies. Based on our investigation, we propose a self-normalization module in the network. The improved network model, called Self-Normalized Density Map (SNDM), can correct its output density map by itself to accurately predict the total number of objects in the image. The SNDM architecture outperforms the original model. Moreover, both statistical frameworks -- bootstrap and MC dropout -- have consistent statistical results for SNDM, which were not observed in the original model.  ( 2 min )
    Equivariant Subgraph Aggregation Networks. (arXiv:2110.02910v3 [cs.LG] UPDATED)
    Message-passing neural networks (MPNNs) are the leading architecture for deep learning on graph-structured data, in large part due to their simplicity and scalability. Unfortunately, it was shown that these architectures are limited in their expressive power. This paper proposes a novel framework called Equivariant Subgraph Aggregation Networks (ESAN) to address this issue. Our main observation is that while two graphs may not be distinguishable by an MPNN, they often contain distinguishable subgraphs. Thus, we propose to represent each graph as a set of subgraphs derived by some predefined policy, and to process it using a suitable equivariant architecture. We develop novel variants of the 1-dimensional Weisfeiler-Leman (1-WL) test for graph isomorphism, and prove lower bounds on the expressiveness of ESAN in terms of these new WL variants. We further prove that our approach increases the expressive power of both MPNNs and more expressive architectures. Moreover, we provide theoretical results that describe how design choices such as the subgraph selection policy and equivariant neural architecture affect our architecture's expressive power. To deal with the increased computational cost, we propose a subgraph sampling scheme, which can be viewed as a stochastic version of our framework. A comprehensive set of experiments on real and synthetic datasets demonstrates that our framework improves the expressive power and overall performance of popular GNN architectures.  ( 2 min )
    Practical data monitoring in the internet-services domain. (arXiv:2203.08067v2 [cs.LG] UPDATED)
    Large-scale monitoring, anomaly detection, and root cause analysis of metrics are essential requirements of the internet-services industry. To address the need to continuously monitor millions of metrics, many anomaly detection approaches are being used on a daily basis by large internet-based companies. However, in spite of the significant progress made to accurately and efficiently detect anomalies in metrics, the sheer scale of the number of metrics has meant there are still a large number of false alarms that need to be investigated. This paper presents a framework for reliable large-scale anomaly detection. It is significantly more accurate than existing approaches and allows for easy interpretation of models, thus enabling practical data monitoring in the internet-services domain.  ( 2 min )
    Meta-Learning with Fewer Tasks through Task Interpolation. (arXiv:2106.02695v2 [cs.LG] UPDATED)
    Meta-learning enables algorithms to quickly learn a newly encountered task with just a few labeled examples by transferring previously learned knowledge. However, the bottleneck of current meta-learning algorithms is the requirement of a large number of meta-training tasks, which may not be accessible in real-world scenarios. To address the challenge that available tasks may not densely sample the space of tasks, we propose to augment the task set through interpolation. By meta-learning with task interpolation (MLTI), our approach effectively generates additional tasks by randomly sampling a pair of tasks and interpolating the corresponding features and labels. Under both gradient-based and metric-based meta-learning settings, our theoretical analysis shows MLTI corresponds to a data-adaptive meta-regularization and further improves the generalization. Empirically, in our experiments on eight datasets from diverse domains including image recognition, pose prediction, molecule property prediction, and medical image classification, we find that the proposed general MLTI framework is compatible with representative meta-learning algorithms and consistently outperforms other state-of-the-art strategies.  ( 2 min )
    Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining. (arXiv:2110.08450v2 [cs.LG] UPDATED)
    Improving the training and inference performance of graph neural networks (GNNs) is faced with a challenge uncommon in general neural networks: creating mini-batches requires a lot of computation and data movement due to the exponential growth of multi-hop graph neighborhoods along network layers. Such a unique challenge gives rise to a diverse set of system design choices. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which we identify major performance bottlenecks hitherto under-explored by developers: mini-batch preparation and transfer. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler, a shared-memory parallelization strategy, and the pipelining of batch transfer with GPU computation. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised. Such an observation unifies training and inference, simplifying model implementation. We report comprehensive experimental results with several benchmark data sets and GNN architectures, including a demonstration that, for the ogbn-papers100M data set, our system SALIENT achieves a speedup of 3x over a standard PyTorch-Geometric implementation with a single GPU and a further 8x parallel speedup with 16 GPUs. Therein, training a 3-layer GraphSAGE model with sampling fanout (15, 10, 5) takes 2.0 seconds per epoch and inference with fanout (20, 20, 20) takes 2.4 seconds, attaining test accuracy 64.58%.  ( 2 min )
    On the Spectral Bias of Convolutional Neural Tangent and Gaussian Process Kernels. (arXiv:2203.09255v1 [cs.LG])
    We study the properties of various over-parametrized convolutional neural architectures through their respective Gaussian process and neural tangent kernels. We prove that, with normalized multi-channel input and ReLU activation, the eigenfunctions of these kernels with the uniform measure are formed by products of spherical harmonics, defined over the channels of the different pixels. We next use hierarchical factorizable kernels to bound their respective eigenvalues. We show that the eigenvalues decay polynomially, quantify the rate of decay, and derive measures that reflect the composition of hierarchical features in these networks. Our results provide concrete quantitative characterization of over-parameterized convolutional network architectures.  ( 2 min )
    Robustness through Cognitive Dissociation Mitigation in Contrastive Adversarial Training. (arXiv:2203.08959v1 [cs.LG])
    In this paper, we introduce a novel neural network training framework that increases model's adversarial robustness to adversarial attacks while maintaining high clean accuracy by combining contrastive learning (CL) with adversarial training (AT). We propose to improve model robustness to adversarial attacks by learning feature representations that are consistent under both data augmentations and adversarial perturbations. We leverage contrastive learning to improve adversarial robustness by considering an adversarial example as another positive example, and aim to maximize the similarity between random augmentations of data samples and their adversarial example, while constantly updating the classification head in order to avoid a cognitive dissociation between the classification head and the embedding space. This dissociation is caused by the fact that CL updates the network up to the embedding space, while freezing the classification head which is used to generate new positive adversarial examples. We validate our method, Contrastive Learning with Adversarial Features(CLAF), on the CIFAR-10 dataset on which it outperforms both robust accuracy and clean accuracy over alternative supervised and self-supervised adversarial learning methods.  ( 2 min )
    A Novel Sleep Stage Classification Using CNN Generated by an Efficient Neural Architecture Search with a New Data Processing Trick. (arXiv:2110.15277v2 [eess.SP] UPDATED)
    With the development of automatic sleep stage classification (ASSC) techniques, many classical methods such as k-means, decision tree, and SVM have been used in automatic sleep stage classification. However, few methods explore deep learning on ASSC. Meanwhile, most deep learning methods require extensive expertise and suffer from a mass of handcrafted steps which are time-consuming especially when dealing with multi-classification tasks. In this paper, we propose an efficient five-sleep-stage classification method using convolutional neural networks (CNNs) with a novel data processing trick and we design neural architecture search (NAS) technique based on genetic algorithm (GA), NAS-G, to search for the best CNN architecture. Firstly, we attach each kernel with an adaptive coefficient to enhance the signal processing of the inputs. This can enhance the propagation of informative features and suppress the propagation of useless features in the early stage of the network. Then, we make full use of GA's heuristic search and the advantage of no need for the gradient to search for the best architecture of CNN. This can achieve a CNN with better performance than a handcrafted one in a large search space at the minimum cost. We verify the convergence of our data processing trick and compare the performance of traditional CNNs before and after using our trick. Meanwhile, we compare the performance between the CNN generated through NAS-G and the traditional CNNs with our trick. The experiments demonstrate that the convergence of CNNs with data processing trick is faster than without data processing trick and the CNN with data processing trick generated by NAS-G outperforms the handcrafted counterparts that use the data processing trick too.  ( 3 min )
    Symmetry-Based Representations for Artificial and Biological General Intelligence. (arXiv:2203.09250v1 [q-bio.NC])
    Biological intelligence is remarkable in its ability to produce complex behaviour in many diverse situations through data efficient, generalisable and transferable skill acquisition. It is believed that learning "good" sensory representations is important for enabling this, however there is little agreement as to what a good representation should look like. In this review article we are going to argue that symmetry transformations are a fundamental principle that can guide our search for what makes a good representation. The idea that there exist transformations (symmetries) that affect some aspects of the system but not others, and their relationship to conserved quantities has become central in modern physics, resulting in a more unified theoretical framework and even ability to predict the existence of new particles. Recently, symmetries have started to gain prominence in machine learning too, resulting in more data efficient and generalisable algorithms that can mimic some of the complex behaviours produced by biological intelligence. Finally, first demonstrations of the importance of symmetry transformations for representation learning in the brain are starting to arise in neuroscience. Taken together, the overwhelming positive effect that symmetries bring to these disciplines suggest that they may be an important general framework that determines the structure of the universe, constrains the nature of natural tasks and consequently shapes both biological and artificial intelligence.  ( 2 min )
    On Redundancy and Diversity in Cell-based Neural Architecture Search. (arXiv:2203.08887v1 [stat.ML])
    Searching for the architecture cells is a dominant paradigm in NAS. However, little attention has been devoted to the analysis of the cell-based search spaces even though it is highly important for the continual development of NAS. In this work, we conduct an empirical post-hoc analysis of architectures from the popular cell-based search spaces and find that the existing search spaces contain a high degree of redundancy: the architecture performance is minimally sensitive to changes at large parts of the cells, and universally adopted designs, like the explicit search for a reduction cell, significantly increase the complexities but have very limited impact on the performance. Across architectures found by a diverse set of search strategies, we consistently find that the parts of the cells that do matter for architecture performance often follow similar and simple patterns. By explicitly constraining cells to include these patterns, randomly sampled architectures can match or even outperform the state of the art. These findings cast doubts into our ability to discover truly novel architectures in the existing cell-based search spaces, and inspire our suggestions for improvement to guide future NAS research. Code is available at https://github.com/xingchenwan/cell-based-NAS-analysis.  ( 2 min )
    Continual Learning Based on OOD Detection and Task Masking. (arXiv:2203.09450v1 [cs.CV])
    Existing continual learning techniques focus on either task incremental learning (TIL) or class incremental learning (CIL) problem, but not both. CIL and TIL differ mainly in that the task-id is provided for each test sample during testing for TIL, but not provided for CIL. Continual learning methods intended for one problem have limitations on the other problem. This paper proposes a novel unified approach based on out-of-distribution (OOD) detection and task masking, called CLOM, to solve both problems. The key novelty is that each task is trained as an OOD detection model rather than a traditional supervised learning model, and a task mask is trained to protect each task to prevent forgetting. Our evaluation shows that CLOM outperforms existing state-of-the-art baselines by large margins. The average TIL/CIL accuracy of CLOM over six experiments is 87.6/67.9% while that of the best baselines is only 82.4/55.0%.  ( 2 min )
    Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning. (arXiv:2203.09137v1 [cs.LG])
    Model-agnostic meta-learning (MAML) and its variants have become popular approaches for few-shot learning. However, due to the non-convexity of deep neural nets (DNNs) and the bi-level formulation of MAML, the theoretical properties of MAML with DNNs remain largely unknown. In this paper, we first prove that MAML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. Our convergence analysis indicates that MAML with over-parameterized DNNs is equivalent to kernel regression with a novel class of kernels, which we name as Meta Neural Tangent Kernels (MetaNTK). Then, we propose MetaNTK-NAS, a new training-free neural architecture search (NAS) method for few-shot learning that uses MetaNTK to rank and select architectures. Empirically, we compare our MetaNTK-NAS with previous NAS methods on two popular few-shot learning benchmarks, miniImageNet, and tieredImageNet. We show that the performance of MetaNTK-NAS is comparable or better than the state-of-the-art NAS method designed for few-shot learning while enjoying more than 100x speedup. We believe the efficiency of MetaNTK-NAS makes itself more practical for many real-world tasks.  ( 2 min )
    Defending Against Adversarial Attack in ECG Classification with Adversarial Distillation Training. (arXiv:2203.09487v1 [eess.SP])
    In clinics, doctors rely on electrocardiograms (ECGs) to assess severe cardiac disorders. Owing to the development of technology and the increase in health awareness, ECG signals are currently obtained by using medical and commercial devices. Deep neural networks (DNNs) can be used to analyze these signals because of their high accuracy rate. However, researchers have found that adversarial attacks can significantly reduce the accuracy of DNNs. Studies have been conducted to defend ECG-based DNNs against traditional adversarial attacks, such as projected gradient descent (PGD), and smooth adversarial perturbation (SAP) which targets ECG classification; however, to the best of our knowledge, no study has completely explored the defense against adversarial attacks targeting ECG classification. Thus, we did different experiments to explore the effects of defense methods against white-box adversarial attack and black-box adversarial attack targeting ECG classification, and we found that some common defense methods performed well against these attacks. Besides, we proposed a new defense method called Adversarial Distillation Training (ADT) which comes from defensive distillation and can effectively improve the generalization performance of DNNs. The results show that our method performed more effectively against adversarial attacks targeting on ECG classification than the other baseline methods, namely, adversarial training, defensive distillation, Jacob regularization, and noise-to-signal ratio regularization. Furthermore, we found that our method performed better against PGD attacks with low noise levels, which means that our method has stronger robustness.  ( 2 min )
    $\ell_p$ Slack Norm Support Vector Data Description. (arXiv:2203.08932v1 [cs.LG])
    The support vector data description (SVDD) approach serves as a de facto standard for one-class classification where the learning task entails inferring the smallest hyper-sphere to enclose target objects while linearly penalising any errors/slacks via an $\ell_1$-norm penalty term. In this study, we generalise this modelling formalism to a general $\ell_p$-norm ($p\geq1$) slack penalty function. By virtue of an $\ell_p$ slack norm, the proposed approach enables formulating a non-linear cost function with respect to slacks. From a dual problem perspective, the proposed method introduces a sparsity-inducing dual norm into the objective function, and thus, possesses a higher capacity to tune into the inherent sparsity of the problem for enhanced descriptive capability. A theoretical analysis based on Rademacher complexities characterises the generalisation performance of the proposed approach in terms of parameter $p$ while the experimental results on several datasets confirm the merits of the proposed method compared to other alternatives.  ( 2 min )
    Kan Extensions in Data Science and Machine Learning. (arXiv:2203.09018v1 [cs.LG])
    A common problem in data science is "use this function defined over this small set to generate predictions over that larger set." Extrapolation, interpolation, statistical inference and forecasting all reduce to this problem. The Kan extension is a powerful tool in category theory that generalizes this notion. In this work we explore several applications of Kan extensions to data science. We begin by deriving a simple classification algorithm as a Kan extension and experimenting with this algorithm on real data. Next, we use the Kan extension to derive a procedure for learning clustering algorithms from labels and explore the performance of this procedure on real data. We then investigate how Kan extensions can be used to learn a general mapping from datasets of labeled examples to functions and to approximate a complex function with a simpler one.  ( 2 min )
    Playing with blocks: Toward re-usable deep learning models for side-channel profiled attacks. (arXiv:2203.08448v2 [cs.CR] UPDATED)
    This paper introduces a deep learning modular network for side-channel analysis. Our deep learning approach features the capability to exchange part of it (modules) with others networks. We aim to introduce reusable trained modules into side-channel analysis instead of building architectures for each evaluation, reducing the body of work when conducting those. Our experiments demonstrate that our architecture feasibly assesses a side-channel evaluation suggesting that learning transferability is possible with the network we propose in this paper.  ( 2 min )
    Stochastic and Private Nonconvex Outlier-Robust PCA. (arXiv:2203.09276v1 [cs.LG])
    We develop theoretically guaranteed stochastic methods for outlier-robust PCA. Outlier-robust PCA seeks an underlying low-dimensional linear subspace from a dataset that is corrupted with outliers. We are able to show that our methods, which involve stochastic geodesic gradient descent over the Grassmannian manifold, converge and recover an underlying subspace in various regimes through the development of a novel convergence analysis. The main application of this method is an effective differentially private algorithm for outlier-robust PCA that uses a Gaussian noise mechanism within the stochastic gradient method. Our results emphasize the advantages of the nonconvex methods over another convex approach to solving this problem in the differentially private setting. Experiments on synthetic and stylized data verify these results.  ( 2 min )
    How Low Can We Go: Trading Memory for Error in Low-Precision Training. (arXiv:2106.09686v3 [cs.LG] UPDATED)
    Low-precision arithmetic trains deep learning models using less energy, less memory and less time. However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.  ( 2 min )
    MolNet: A Chemically Intuitive Graph Neural Network for Prediction of Molecular Properties. (arXiv:2203.09456v1 [physics.chem-ph])
    The graph neural network (GNN) has been a powerful deep-learning tool in chemistry domain, due to its close connection with molecular graphs. Most GNN models collect and update atom and molecule features from the fed atom (and, in some cases, bond) features, which are basically based on the two-dimensional (2D) graph representation of 3D molecules. Correspondingly, the adjacency matrix, containing the information on covalent bonds, or equivalent data structures (e.g., lists) have been the main core in the feature-updating processes, such as graph convolution. However, the 2D-based models do not faithfully represent 3D molecules and their physicochemical properties, exemplified by the overlooked field effect that is a "through-space" effect, not a "through-bond" effect. The GNN model proposed herein, denoted as MolNet, is chemically intuitive, accommodating the 3D non-bond information in a molecule, with a noncovalent adjacency matrix $\bf{\bar A}$, and also bond-strength information from a weighted bond matrix $\bf{B}$. The noncovalent atoms, not directly bonded to a given atom in a molecule, are identified within 5 $\unicode{x212B}$ of cut-off range for the construction of $\bf{\bar A}$, and $\bf{B}$ has edge weights of 1, 1.5, 2, and 3 for single, aromatic, double, and triple bonds, respectively. Comparative studies show that MolNet outperforms various baseline GNN models and gives a state-of-the-art performance in the classification task of BACE dataset and regression task of ESOL dataset. This work suggests a future direction of deep-learning chemistry in the construction of deep-learning models that are chemically intuitive and comparable with the existing chemistry concepts and tools.  ( 2 min )
    Visualizing Riemannian data with Rie-SNE. (arXiv:2203.09253v1 [cs.LG])
    Faithful visualizations of data residing on manifolds must take the underlying geometry into account when producing a flat planar view of the data. In this paper, we extend the classic stochastic neighbor embedding (SNE) algorithm to data on general Riemannian manifolds. We replace standard Gaussian assumptions with Riemannian diffusion counterparts and propose an efficient approximation that only requires access to calculations of Riemannian distances and volumes. We demonstrate that the approach also allows for mapping data from one manifold to another, e.g. from a high-dimensional sphere to a low-dimensional one.  ( 2 min )
    Memorizing Transformers. (arXiv:2203.08913v1 [cs.LG])
    Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.  ( 2 min )
    Symbolic Learning to Optimize: Towards Interpretability and Scalability. (arXiv:2203.06578v2 [cs.LG] UPDATED)
    Recent studies on Learning to Optimize (L2O) suggest a promising path to automating and accelerating the optimization procedure for complicated tasks. Existing L2O models parameterize optimization rules by neural networks, and learn those numerical rules via meta-training. However, they face two common pitfalls: (1) scalability: the numerical rules represented by neural networks create extra memory overhead for applying L2O models, and limit their applicability to optimizing larger tasks; (2) interpretability: it is unclear what an L2O model has learned in its black-box optimization rule, nor is it straightforward to compare different L2O models in an explainable way. To avoid both pitfalls, this paper proves the concept that we can "kill two birds by one stone", by introducing the powerful tool of symbolic regression to L2O. In this paper, we establish a holistic symbolic representation and analysis framework for L2O, which yields a series of insights for learnable optimizers. Leveraging our findings, we further propose a lightweight L2O model that can be meta-trained on large-scale problems and outperformed human-designed and tuned optimizers. Our work is set to supply a brand-new perspective to L2O research. Codes are available at: https://github.com/VITA-Group/Symbolic-Learning-To-Optimize.  ( 2 min )
    Excited state, non-adiabatic dynamics of large photoswitchable molecules using a chemically transferable machine learning potential. (arXiv:2108.04879v3 [physics.chem-ph] UPDATED)
    Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive simulations. In practice, these simulations are rarely used for screening, since they require hundreds of trajectories and expensive quantum chemical methods to account for non-adiabatic excited state effects. Here we introduce a diabatic artificial neural network (DANN) based on diabatic states to accelerate such simulations for azobenzene derivatives. The network is six orders of magnitude faster than the quantum chemistry method used for training. DANN is transferable to azobenzene molecules outside the training set, predicting quantum yields for unseen species that are correlated with experiment. We use the model to virtually screen 3,100 hypothetical molecules, and identify novel species with extremely high predicted quantum yields. The model predictions are confirmed using high accuracy non-adiabatic dynamics. Our results pave the way for fast and accurate virtual screening of photoactive compounds.  ( 2 min )
    Extracting associations and meanings of objects depicted in artworks through bi-modal deep networks. (arXiv:2203.07026v2 [cs.CV] UPDATED)
    We present a novel bi-modal system based on deep networks to address the problem of learning associations and simple meanings of objects depicted in "authored" images, such as fine art paintings and drawings. Our overall system processes both the images and associated texts in order to learn associations between images of individual objects, their identities and the abstract meanings they signify. Unlike past deep nets that describe depicted objects and infer predicates, our system identifies meaning-bearing objects ("signifiers") and their associations ("signifieds") as well as basic overall meanings for target artworks. Our system had precision of 48% and recall of 78% with an F1 metric of 0.6 on a curated set of Dutch vanitas paintings, a genre celebrated for its concentration on conveying a meaning of great import at the time of their execution. We developed and tested our system on fine art paintings but our general methods can be applied to other authored images.  ( 2 min )
    Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. (arXiv:2203.06717v2 [cs.CV] UPDATED)
    We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances of vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient high-performance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31x31, in contrast to commonly used 3x3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large-kernel CNNs have much larger effective receptive fields, and higher shape bias rather than texture bias. Code & models at https://github.com/megvii-research/RepLKNet.  ( 2 min )
    Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning. (arXiv:2203.09064v1 [cs.CV])
    This paper presents new hierarchically cascaded transformers that can improve data efficiency through attribute surrogates learning and spectral tokens pooling. Vision transformers have recently been thought of as a promising alternative to convolutional neural networks for visual recognition. But when there is no sufficient data, it gets stuck in overfitting and shows inferior performance. To improve data efficiency, we propose hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling and optimize the learnable parameters through latent attribute surrogates. The intrinsic image structure is utilized to reduce the ambiguity between foreground content and background noise by spectral tokens pooling. And the attribute surrogate learning scheme is designed to benefit from the rich visual information in image-label pairs instead of simple visual concepts assigned by their labels. Our Hierarchically Cascaded Transformers, called HCTransformers, is built upon a self-supervised learning framework DINO and is tested on several popular few-shot learning benchmarks. In the inductive setting, HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet, which demonstrates HCTransformers are efficient to extract discriminative features. Also, HCTransformers show clear advantages over SOTA few-shot classification methods in both 5-way 1-shot and 5-way 5-shot settings on four popular benchmark datasets, including miniImageNet, tieredImageNet, FC100, and CIFAR-FS. The trained weights and codes are available at https://github.com/StomachCold/HCTransformers.  ( 2 min )
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v1 [cs.CV])
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    Maximum Likelihood Estimation in Gaussian Process Regression is Ill-Posed. (arXiv:2203.09179v1 [math.ST])
    Gaussian process regression underpins countless academic and industrial applications of machine learning and statistics, with maximum likelihood estimation routinely used to select appropriate parameters for the covariance kernel. However, it remains an open problem to establish the circumstances in which maximum likelihood estimation is well-posed. That is, when the predictions of the regression model are continuous (or insensitive to small perturbations) in the training data. This article presents a rigorous proof that the maximum likelihood estimator fails to be well-posed in Hellinger distance in a scenario where the data are noiseless. The failure case occurs for any Gaussian process with a stationary covariance function whose lengthscale parameter is estimated using maximum likelihood. Although the failure of maximum likelihood estimation is informally well-known, these theoretical results appear to be the first of their kind, and suggest that well-posedness may need to be assessed post-hoc, on a case-by-case basis, when maximum likelihood estimation is used to train a Gaussian process model.  ( 2 min )
    Understanding Intrinsic Robustness Using Label Uncertainty. (arXiv:2107.03250v2 [cs.LG] UPDATED)
    A fundamental question in adversarial machine learning is whether a robust classifier exists for a given task. A line of research has made some progress towards this goal by studying the concentration of measure, but we argue standard concentration fails to fully characterize the intrinsic robustness of a classification problem since it ignores data labels which are essential to any classification task. Building on a novel definition of label uncertainty, we empirically demonstrate that error regions induced by state-of-the-art models tend to have much higher label uncertainty than randomly-selected subsets. This observation motivates us to adapt a concentration estimation algorithm to account for label uncertainty, resulting in more accurate intrinsic robustness measures for benchmark image classification problems.  ( 2 min )
    Euler State Networks. (arXiv:2203.09382v1 [cs.LG])
    Inspired by the numerical solution of ordinary differential equations, in this paper we propose a novel Reservoir Computing (RC) model, called the Euler State Network (EuSN). The introduced approach makes use of forward Euler discretization and antisymmetric recurrent matrices to design reservoir dynamics that are both stable and non-dissipative by construction. Our mathematical analysis shows that the resulting model is biased towards unitary effective spectral radius and zero local Lyapunov exponents, intrinsically operating at the edge of stability. Experiments on synthetic tasks indicate the marked superiority of the proposed approach, compared to standard RC models, in tasks requiring long-term memorization skills. Furthermore, results on real-world time series classification benchmarks point out that EuSN is capable of matching (or even surpassing) the level of accuracy of trainable Recurrent Neural Networks, while allowing up to 100-fold savings in computation time and energy consumption.  ( 2 min )
    Stability and Risk Bounds of Iterative Hard Thresholding. (arXiv:2203.09413v1 [stat.ML])
    In this paper, we analyze the generalization performance of the Iterative Hard Thresholding (IHT) algorithm widely used for sparse recovery problems. The parameter estimation and sparsity recovery consistency of IHT has long been known in compressed sensing. From the perspective of statistical learning, another fundamental question is how well the IHT estimation would predict on unseen data. This paper makes progress towards answering this open question by introducing a novel sparse generalization theory for IHT under the notion of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; 2) a tighter $\mathcal{\tilde O}(n^{-1/2}\sqrt{\log(n)})$ bound can be established by imposing an additional iteration stability condition on a hypothetical IHT procedure invoked to the population risk; and 3) a fast rate of order $\mathcal{\tilde O}\left(n^{-1}k(\log^3(n)+\log(p))\right)$ can be derived for strongly convex risk function under proper strong-signal conditions. The results have been substantialized to sparse linear regression and sparse logistic regression models to demonstrate the applicability of our theory. Preliminary numerical evidence is provided to confirm our theoretical predictions.  ( 2 min )
    Adaptive n-ary Activation Functions for Probabilistic Boolean Logic. (arXiv:2203.08977v1 [cs.LG])
    Balancing model complexity against the information contained in observed data is the central challenge to learning. In order for complexity-efficient models to exist and be discoverable in high dimensions, we require a computational framework that relates a credible notion of complexity to simple parameter representations. Further, this framework must allow excess complexity to be gradually removed via gradient-based optimization. Our n-ary, or n-argument, activation functions fill this gap by approximating belief functions (probabilistic Boolean logic) using logit representations of probability. Just as Boolean logic determines the truth of a consequent claim from relationships among a set of antecedent propositions, probabilistic formulations generalize predictions when antecedents, truth tables, and consequents all retain uncertainty. Our activation functions demonstrate the ability to learn arbitrary logic, such as the binary exclusive disjunction (p xor q) and ternary conditioned disjunction ( c ? p : q ), in a single layer using an activation function of matching or greater arity. Further, we represent belief tables using a basis that directly associates the number of nonzero parameters to the effective arity of the belief function, thus capturing a concrete relationship between logical complexity and efficient parameter representations. This opens optimization approaches to reduce logical complexity by inducing parameter sparsity.  ( 2 min )
    Contrastive Learning with Positive-Negative Frame Mask for Music Representation. (arXiv:2203.09129v1 [cs.SD])
    Self-supervised learning, especially contrastive learning, has made an outstanding contribution to the development of many deep learning research fields. Recently, researchers in the acoustic signal processing field noticed its success and leveraged contrastive learning for better music representation. Typically, existing approaches maximize the similarity between two distorted audio segments sampled from the same music. In other words, they ensure a semantic agreement at the music level. However, those coarse-grained methods neglect some inessential or noisy elements at the frame level, which may be detrimental to the model to learn the effective representation of music. Towards this end, this paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. Concretely, PEMR incorporates a Positive-Negative Mask Generation module, which leverages transformer blocks to generate frame masks on the Log-Mel spectrogram. We can generate self-augmented negative and positive samples by masking important components or inessential components, respectively. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music. We conduct experiments on four public datasets. The experimental results of two music-related downstream tasks, music classification, and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.  ( 2 min )
    Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model. (arXiv:2101.05467v3 [cs.LG] UPDATED)
    The drastic increase of data quantity often brings the severe decrease of data quality, such as incorrect label annotations, which poses a great challenge for robustly training Deep Neural Networks (DNNs). Existing learning \mbox{methods} with label noise either employ ad-hoc heuristics or restrict to specific noise assumptions. However, more general situations, such as instance-dependent label noise, have not been fully explored, as scarce studies focus on their label corruption process. By categorizing instances into confusing and unconfusing instances, this paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances. The resultant model can be realized by DNNs, where the training procedure is accomplished by employing an alternating optimization algorithm. Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness over state-of-the-art counterparts.  ( 2 min )
    Recurrent Neural Network-based Internal Model Control design for stable nonlinear systems. (arXiv:2108.04585v3 [cs.LG] UPDATED)
    Owing to their superior modeling capabilities, gated Recurrent Neural Networks, such as Gated Recurrent Units (GRUs) and Long Short-Term Memory networks (LSTMs), have become popular tools for learning dynamical systems. This paper aims to discuss how these networks can be adopted for the synthesis of Internal Model Control (IMC) architectures. To this end, first a gated recurrent network is used to learn a model of the unknown input-output stable plant. Then, a controller gated recurrent network is trained to approximate the model inverse. The stability of these networks, ensured by means of a suitable training procedure, allows to guarantee the input-output closed-loop stability. The proposed scheme is able to cope with the saturation of the control variables, and can be deployed on low-power embedded controllers, as it requires limited online computations. The approach is then tested on the Quadruple Tank benchmark system and compared to alternative control laws, resulting in remarkable closed-loop performances.  ( 2 min )
    Fine-grained style control in Transformer-based Text-to-speech Synthesis. (arXiv:2110.06306v2 [eess.AS] UPDATED)
    In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.  ( 2 min )
    Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. (arXiv:2105.03363v3 [cs.LG] UPDATED)
    This paper investigates the model-based methods in multi-agent reinforcement learning (MARL). We specify the dynamics sample complexity and the opponent sample complexity in MARL, and conduct a theoretic analysis of return discrepancy upper bound. To reduce the upper bound with the intention of low sample complexity during the whole learning process, we propose a novel decentralized model-based MARL method, named Adaptive Opponent-wise Rollout Policy Optimization (AORPO). In AORPO, each agent builds its multi-agent environment model, consisting of a dynamics model and multiple opponent models, and trains its policy with the adaptive opponent-wise rollout. We further prove the theoretic convergence of AORPO under reasonable assumptions. Empirical experiments on competitive and cooperative tasks demonstrate that AORPO can achieve improved sample efficiency with comparable asymptotic performance over the compared MARL methods.  ( 2 min )
    FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes. (arXiv:2110.08059v3 [cs.CV] UPDATED)
    When designing Convolutional Neural Networks (CNNs), one must select the size\break of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy.  ( 2 min )
    Noisy Tensor Completion via Low-rank Tensor Ring. (arXiv:2203.08857v1 [stat.ML])
    Tensor completion is a fundamental tool for incomplete data analysis, where the goal is to predict missing entries from partial observations. However, existing methods often make the explicit or implicit assumption that the observed entries are noise-free to provide a theoretical guarantee of exact recovery of missing entries, which is quite restrictive in practice. To remedy such drawbacks, this paper proposes a novel noisy tensor completion model, which complements the incompetence of existing works in handling the degeneration of high-order and noisy observations. Specifically, the tensor ring nuclear norm (TRNN) and least-squares estimator are adopted to regularize the underlying tensor and the observed entries, respectively. In addition, a non-asymptotic upper bound of estimation error is provided to depict the statistical performance of the proposed estimator. Two efficient algorithms are developed to solve the optimization problem with convergence guarantee, one of which is specially tailored to handle large-scale tensors by replacing the minimization of TRNN of the original tensor equivalently with that of a much smaller one in a heterogeneous tensor decomposition framework. Experimental results on both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed model in recovering noisy incomplete tensor data compared with state-of-the-art tensor completion models.  ( 2 min )
    Deep Generative Models in Engineering Design: A Review. (arXiv:2110.10863v4 [cs.LG] UPDATED)
    Automated design synthesis has the potential to revolutionize the modern engineering design process and improve access to highly optimized and customized products across countless industries. Successfully adapting generative Machine Learning to design engineering may enable such automated design synthesis and is a research subject of great importance. We present a review and analysis of Deep Generative Machine Learning models in engineering design. Deep Generative Models (DGMs) typically leverage deep networks to learn from an input dataset and synthesize new designs. Recently, DGMs such as feedforward Neural Networks (NNs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and certain Deep Reinforcement Learning (DRL) frameworks have shown promising results in design applications like structural optimization, materials design, and shape synthesis. The prevalence of DGMs in engineering design has skyrocketed since 2016. Anticipating continued growth, we conduct a review of recent advances to benefit researchers interested in DGMs for design. We structure our review as an exposition of the algorithms, datasets, representation methods, and applications commonly used in the current literature. In particular, we discuss key works that have introduced new techniques and methods in DGMs, successfully applied DGMs to a design-related domain, or directly supported the development of DGMs through datasets or auxiliary methods. We further identify key challenges and limitations currently seen in DGMs across design fields, such as design creativity, handling constraints and objectives, and modeling both form and functional performance simultaneously. In our discussion, we identify possible solution pathways as key areas on which to target future work.  ( 2 min )
    Do We Really Need a Learnable Classifier at the End of Deep Neural Network?. (arXiv:2203.09081v1 [cs.LG])
    Modern deep neural networks for classification usually jointly learn a backbone for representation and a linear classifier to output the logit of each class. A recent study has shown a phenomenon called neural collapse that the within-class means of features and the classifier vectors converge to the vertices of a simplex equiangular tight frame (ETF) at the terminal phase of training on a balanced dataset. Since the ETF geometric structure maximally separates the pair-wise angles of all classes in the classifier, it is natural to raise the question, why do we spend an effort to learn a classifier when we know its optimal geometric structure? In this paper, we study the potential of learning a neural network for classification with the classifier randomly initialized as an ETF and fixed during training. Our analytical work based on the layer-peeled model indicates that the feature learning with a fixed ETF classifier naturally leads to the neural collapse state even when the dataset is imbalanced among classes. We further show that in this case the cross entropy (CE) loss is not necessary and can be replaced by a simple squared loss that shares the same global optimality but enjoys a more accurate gradient and better convergence property. Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets, and bring significant improvements in the long-tailed and fine-grained classification tasks.  ( 2 min )
    Are Vision Transformers Robust to Spurious Correlations?. (arXiv:2203.09125v1 [cs.CV])
    Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains underexplored how spurious correlations are manifested in such architectures. In this paper, we systematically investigate the robustness of vision transformers to spurious correlations on three challenging benchmark datasets and compare their performance with popular CNNs. Our study reveals that when pre-trained on a sufficiently large dataset, ViT models are more robust to spurious correlations than CNNs. Key to their success is the ability to generalize better from the examples where spurious correlations do not hold. Further, we perform extensive ablations and experiments to understand the role of the self-attention mechanism in providing robustness under spuriously correlated environments. We hope that our work will inspire future research on further understanding the robustness of ViT models.  ( 2 min )
    Probabilistic Margins for Instance Reweighting in Adversarial Training. (arXiv:2106.07904v2 [cs.LG] UPDATED)
    Reweighting adversarial data during training has been recently shown to improve adversarial robustness, where data closer to the current decision boundaries are regarded as more critical and given larger weights. However, existing methods measuring the closeness are not very reliable: they are discrete and can take only a few values, and they are path-dependent, i.e., they may change given the same start and end points with different attack paths. In this paper, we propose three types of probabilistic margin (PM), which are continuous and path-independent, for measuring the aforementioned closeness and reweighting adversarial data. Specifically, a PM is defined as the difference between two estimated class-posterior probabilities, e.g., such the probability of the true label minus the probability of the most confusing label given some natural data. Though different PMs capture different geometric properties, all three PMs share a negative correlation with the vulnerability of data: data with larger/smaller PMs are safer/riskier and should have smaller/larger weights. Experiments demonstrate that PMs are reliable measurements and PM-based reweighting methods outperform state-of-the-art methods.  ( 2 min )
    Transfer learning for cross-modal demand prediction of bike-share and public transit. (arXiv:2203.09279v1 [cs.LG])
    The urban transportation system is a combination of multiple transport modes, and the interdependencies across those modes exist. This means that the travel demand across different travel modes could be correlated as one mode may receive demand from or create demand for another mode, not to mention natural correlations between different demand time series due to general demand flow patterns across the network. It is expectable that cross-modal ripple effects become more prevalent, with Mobility as a Service. Therefore, by propagating demand data across modes, a better demand prediction could be obtained. To this end, this study explores various machine learning models and transfer learning strategies for cross-modal demand prediction. The trip data of bike-share, metro, and taxi are processed as the station-level passenger flows, and then the proposed prediction method is tested in the large-scale case studies of Nanjing and Chicago. The results suggest that prediction models with transfer learning perform better than unimodal prediction models. Furthermore, stacked Long Short-Term Memory model performs particularly well in cross-modal demand prediction. These results verify our combined method's forecasting improvement over existing benchmarks and demonstrate the good transferability for cross-modal demand prediction in multiple cities.  ( 2 min )
    Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input. (arXiv:2203.09123v1 [cs.CV])
    The transferability of adversarial examples allows the deception on black-box models, and transfer-based targeted attacks have attracted a lot of interest due to their practical applicability. To maximize the transfer success rate, adversarial examples should avoid overfitting to the source model, and image augmentation is one of the primary approaches for this. However, prior works utilize simple image transformations such as resizing, which limits input diversity. To tackle this limitation, we propose the object-based diverse input (ODI) method that draws an adversarial image on a 3D object and induces the rendered image to be classified as the target class. Our motivation comes from the humans' superior perception of an image printed on a 3D object. If the image is clear enough, humans can recognize the image content in a variety of viewing conditions. Likewise, if an adversarial example looks like the target class to the model, the model should also classify the rendered image of the 3D object as the target class. The ODI method effectively diversifies the input by leveraging an ensemble of multiple source objects and randomizing viewing conditions. In our experimental results on the ImageNet-Compatible dataset, this method boosts the average targeted attack success rate from 28.3% to 47.0% compared to the state-of-the-art methods. We also demonstrate the applicability of the ODI method to adversarial examples on the face verification task and its superior performance improvement. Our code is available at https://github.com/dreamflake/ODI.  ( 2 min )
    Risk-Averse No-Regret Learning in Online Convex Games. (arXiv:2203.08957v1 [cs.LG])
    We consider an online stochastic game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Specifically, we use the Conditional Value at Risk (CVaR) as a risk measure that the agents can estimate using bandit feedback in the form of the cost values of only their selected actions. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, they are themselves unknown and, therefore, the CVaR values of the costs are difficult to compute. To address this challenge, we propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values that are estimated by appropriately sampling the cost functions. We show that this algorithm achieves sub-linear regret with high probability. We also propose two variants of this algorithm that improve performance. The first variant relies on a new sampling strategy that uses samples from the previous iteration to improve the estimation accuracy of the CVaR values. The second variant employs residual feedback that uses CVaR values from the previous iteration to reduce the variance of the CVaR gradient estimates. We theoretically analyze the convergence properties of these variants and illustrate their performance on an online market problem that we model as a Cournot game.  ( 2 min )
    On Multi-Domain Long-Tailed Recognition, Generalization and Beyond. (arXiv:2203.09513v1 [cs.LG])
    Real-world data often exhibit imbalanced label distributions. Existing studies on data imbalance focus on single-domain settings, i.e., samples are from the same data distribution. However, natural data can originate from distinct domains, where a minority class in one domain could have abundant instances from other domains. We formalize the task of Multi-Domain Long-Tailed Recognition (MDLT), which learns from multi-domain imbalanced data, addresses label imbalance, domain shift, and divergent label distributions across domains, and generalizes to all domain-class pairs. We first develop the domain-class transferability graph, and show that such transferability governs the success of learning in MDLT. We then propose BoDA, a theoretically grounded learning strategy that tracks the upper bound of transferability statistics, and ensures balanced alignment and calibration across imbalanced domain-class distributions. We curate five MDLT benchmarks based on widely-used multi-domain datasets, and compare BoDA to twenty algorithms that span different learning strategies. Extensive and rigorous experiments verify the superior performance of BoDA. Further, as a byproduct, BoDA establishes new state-of-the-art on Domain Generalization benchmarks, improving generalization to unseen domains. Code and data are available at https://github.com/YyzHarry/multi-domain-imbalance.  ( 2 min )
    A Glimpse of Physical Layer Decision Mechanisms: Facts, Challenges, and Remedies. (arXiv:2102.07258v2 [cs.LG] UPDATED)
    Communications are realized as a result of successive decisions at the physical layer, from modulation selection to multi-antenna strategy, and each decision affects the performance of the communication systems. Future communication systems must include extensive capabilities as they will encompass a wide variety of devices and applications. Conventional physical layer decision mechanisms may not meet these requirements, as they are often based on impractical and oversimplifying assumptions that result in a trade-off between complexity and efficiency. By leveraging past experiences, learning-driven designs are promising solutions to present a resilient decision mechanism and enable rapid response even under exceptional circumstances. The corresponding design solutions should evolve following the lines of learning-driven paradigms that offer more autonomy and robustness. This evolution must take place by considering the facts of real-world systems and without restraining assumptions. In this paper, the common assumptions in the physical layer are presented to highlight their discrepancies with practical systems. As a solution, learning algorithms are examined by considering the implementation steps and challenges. Furthermore, these issues are discussed through a real-time case study using software-defined radio nodes to demonstrate the potential performance improvement. A cyber-physical framework is presented to incorporate future remedies.  ( 2 min )
    Generalized Classification of Satellite Image Time Series with Thermal Positional Encoding. (arXiv:2203.09175v1 [cs.CV])
    Large-scale crop type classification is a task at the core of remote sensing efforts with applications of both economic and ecological importance. Current state-of-the-art deep learning methods are based on self-attention and use satellite image time series (SITS) to discriminate crop types based on their unique growth patterns. However, existing methods generalize poorly to regions not seen during training mainly due to not being robust to temporal shifts of the growing season caused by variations in climate. To this end, we propose Thermal Positional Encoding (TPE) for attention-based crop classifiers. Unlike previous positional encoding based on calendar time (e.g. day-of-year), TPE is based on thermal time, which is obtained by accumulating daily average temperatures over the growing season. Since crop growth is directly related to thermal time, but not calendar time, TPE addresses the temporal shifts between different regions to improve generalization. We propose multiple TPE strategies, including learnable methods, to further improve results compared to the common fixed positional encodings. We demonstrate our approach on a crop classification task across four different European regions, where we obtain state-of-the-art generalization results.  ( 2 min )
    Latent Structure Mining with Contrastive Modality Fusion for Multimedia Recommendation. (arXiv:2111.00678v2 [cs.IR] UPDATED)
    Recent years have witnessed growing interests in multimedia recommendation, which aims to predict whether a user will interact with an item with multimodal contents. Previous studies focus on modeling user-item interactions with multimodal features included as side information. However, this scheme is not well-designed for multimedia recommendation. Firstly, only collaborative item-item relationships are implicitly modeled through high-order item-user-item co-occurrences. We argue that the latent semantic item-item structures underlying these multimodal contents could be beneficial for learning better item representations and assist the recommender models to comprehensively discover candidate items. Secondly, previous studies disregard the fine-grained multimodal fusion. Although having access to multiple modalities might allow us to capture rich information, we argue that the simple coarse-grained fusion by linear combination or concatenation in previous work is insufficient to fully understand content information and item relationships.To this end, we propose a latent structure MIning with ContRastive mOdality fusion method (MICRO for brevity). To be specific, we devise a novel modality-aware structure learning module, which learns item-item relationships for each modality. Based on the learned modality-aware latent item relationships, we perform graph convolutions that explicitly inject item affinities to modality-aware item representations. Then, we design a novel contrastive method to fuse multimodal features. These enriched item representations can be plugged into existing collaborative filtering methods to make more accurate recommendations. Extensive experiments on real-world datasets demonstrate the superiority of our method over state-of-the-art baselines.  ( 2 min )
    DeepAD: A Robust Deep Learning Model of Alzheimer's Disease Progression for Real-World Clinical Applications. (arXiv:2203.09096v1 [cs.LG])
    The ability to predict the future trajectory of a patient is a key step toward the development of therapeutics for complex diseases such as Alzheimer's disease (AD). However, most machine learning approaches developed for prediction of disease progression are either single-task or single-modality models, which can not be directly adopted to our setting involving multi-task learning with high dimensional images. Moreover, most of those approaches are trained on a single dataset (i.e. cohort), which can not be generalized to other cohorts. We propose a novel multimodal multi-task deep learning model to predict AD progression by analyzing longitudinal clinical and neuroimaging data from multiple cohorts. Our proposed model integrates high dimensional MRI features from a 3D convolutional neural network with other data modalities, including clinical and demographic information, to predict the future trajectory of patients. Our model employs an adversarial loss to alleviate the study-specific imaging bias, in particular the inter-study domain shifts. In addition, a Sharpness-Aware Minimization (SAM) optimization technique is applied to further improve model generalization. The proposed model is trained and tested on various datasets in order to evaluate and validate the results. Our results showed that 1) our model yields significant improvement over the baseline models, and 2) models using extracted neuroimaging features from 3D convolutional neural network outperform the same models when applied to MRI-derived volumetric features.  ( 2 min )
    HybridNets: End-to-End Perception Network. (arXiv:2203.09035v1 [cs.CV])
    End-to-end Network has become increasingly important in multi-tasking. One prominent example of this is the growing significance of a driving perception system in autonomous driving. This paper systematically studies an end-to-end perception network for multi-tasking and proposes several key optimizations to improve accuracy. First, the paper proposes efficient segmentation head and box/class prediction networks based on weighted bidirectional feature network. Second, the paper proposes automatically customized anchor for each level in the weighted bidirectional feature network. Third, the paper proposes an efficient training loss function and training strategy to balance and optimize network. Based on these optimizations, we have developed an end-to-end perception network to perform multi-tasking, including traffic object detection, drivable area segmentation and lane detection simultaneously, called HybridNets, which achieves better accuracy than prior art. In particular, HybridNets achieves 77.3 mean Average Precision on Berkeley DeepDrive Dataset, outperforms lane detection with 31.6 mean Intersection Over Union with 12.83 million parameters and 15.6 billion floating-point operations. In addition, it can perform visual perception tasks in real-time and thus is a practical and accurate solution to the multi-tasking problem. Code is available at https://github.com/datvuthanh/HybridNets.  ( 2 min )
    Confidence Dimension for Deep Learning based on Hoeffding Inequality and Relative Evaluation. (arXiv:2203.09082v1 [cs.LG])
    Research on the generalization ability of deep neural networks (DNNs) has recently attracted a great deal of attention. However, due to their complex architectures and large numbers of parameters, measuring the generalization ability of specific DNN models remains an open challenge. In this paper, we propose to use multiple factors to measure and rank the relative generalization of DNNs based on a new concept of confidence dimension (CD). Furthermore, we provide a feasible framework in our CD to theoretically calculate the upper bound of generalization based on the conventional Vapnik-Chervonenk dimension (VC-dimension) and Hoeffding's inequality. Experimental results on image classification and object detection demonstrate that our CD can reflect the relative generalization ability for different DNNs. In addition to full-precision DNNs, we also analyze the generalization ability of binary neural networks (BNNs), whose generalization ability remains an unsolved problem. Our CD yields a consistent and reliable measure and ranking for both full-precision DNNs and BNNs on all the tasks.  ( 2 min )
    SeMask: Semantically Masked Transformers for Semantic Segmentation. (arXiv:2112.12782v2 [cs.CV] UPDATED)
    Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating SeMask into Swin Transformer and Mix Transformer backbones as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation .  ( 2 min )
    A Survey of Non-Rigid 3D Registration. (arXiv:2203.07858v2 [cs.CV] UPDATED)
    Non-rigid registration computes an alignment between a source surface with a target surface in a non-rigid manner. In the past decade, with the advances in 3D sensing technologies that can measure time-varying surfaces, non-rigid registration has been applied for the acquisition of deformable shapes and has a wide range of applications. This survey presents a comprehensive review of non-rigid registration methods for 3D shapes, focusing on techniques related to dynamic shape acquisition and reconstruction. In particular, we review different approaches for representing the deformation field, and the methods for computing the desired deformation. Both optimization-based and learning-based methods are covered. We also review benchmarks and datasets for evaluating non-rigid registration methods, and discuss potential future research directions.  ( 2 min )
    SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. (arXiv:2108.10904v2 [cs.CV] UPDATED)
    With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.  ( 2 min )
    Provable Adversarial Robustness for Fractional Lp Threat Models. (arXiv:2203.08945v1 [cs.LG])
    In recent years, researchers have extensively studied adversarial robustness in a variety of threat models, including L_0, L_1, L_2, and L_infinity-norm bounded adversarial attacks. However, attacks bounded by fractional L_p "norms" (quasi-norms defined by the L_p distance with 0<p<1) have yet to be thoroughly considered. We proactively propose a defense with several desirable properties: it provides provable (certified) robustness, scales to ImageNet, and yields deterministic (rather than high-probability) certified guarantees when applied to quantized data (e.g., images). Our technique for fractional L_p robustness constructs expressive, deep classifiers that are globally Lipschitz with respect to the L_p^p metric, for any 0<p<1. However, our method is even more general: we can construct classifiers which are globally Lipschitz with respect to any metric defined as the sum of concave functions of components. Our approach builds on a recent work, Levine and Feizi (2021), which provides a provable defense against L_1 attacks. However, we demonstrate that our proposed guarantees are highly non-vacuous, compared to the trivial solution of using (Levine and Feizi, 2021) directly and applying norm inequalities. Code is available at https://github.com/alevine0/fractionalLpRobustness.  ( 2 min )
    Distributed-Memory Sparse Kernels for Machine Learning. (arXiv:2203.07673v1 [cs.DC] CROSS LISTED)
    Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.  ( 2 min )
    On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems. (arXiv:2107.08225v2 [math.OC] UPDATED)
    We introduce a class of first-order methods for smooth constrained optimization that are based on an analogy to non-smooth dynamical systems. Two distinctive features of our approach are that (i) projections or optimizations over the entire feasible set are avoided, in stark contrast to projected gradient methods or the Frank-Wolfe method, and (ii) iterates are allowed to become infeasible, which differs from active set or feasible direction methods, where the descent motion stops as soon as a new constraint is encountered. The resulting algorithmic procedure is simple to implement even when constraints are nonlinear, and is suitable for large-scale constrained optimization problems in which the feasible set fails to have a simple structure. The key underlying idea is that constraints are expressed in terms of velocities instead of positions, which has the algorithmic consequence that optimizations over feasible sets at each iteration are replaced with optimizations over local, sparse convex approximations. In particular, this means that at each iteration only constraints that are violated are taken into account. The result is a simplified suite of algorithms and an expanded range of possible applications in machine learning.  ( 2 min )
    Intracranial Hemorrhage Detection Using Neural Network Based Methods With Federated Learning. (arXiv:2005.08644v3 [cs.CV] UPDATED)
    Intracranial hemorrhage, bleeding that occurs inside the cranium, is a serious health problem requiring rapid and often intensive medical treatment. Such a condition is traditionally diagnosed by highly-trained specialists analyzing computed tomography (CT) scan of the patient and identifying the location and type of hemorrhage if one exists. We propose a neural network approach to find and classify the condition based upon the CT scan. The model architecture implements a time distributed convolutional network. We observed accuracy above 92% from such an architecture, provided enough data. We propose further extensions to our approach involving the deployment of federated learning. This would be helpful in pooling learned parameters without violating the inherent privacy of the data involved.  ( 2 min )
    CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation. (arXiv:2203.09343v1 [cs.CV])
    Many recent approaches in contrastive learning have worked to close the gap between pretraining on iconic images like ImageNet and pretraining on complex scenes like COCO. This gap exists largely because commonly used random crop augmentations obtain semantically inconsistent content in crowded scene images of diverse objects. Previous works use preprocessing pipelines to localize salient objects for improved cropping, but an end-to-end solution is still elusive. In this work, we propose a framework which accomplishes this goal via joint learning of representations and segmentation. We leverage segmentation masks to train a model with a mask-dependent contrastive loss, and use the partially trained model to bootstrap better masks. By iterating between these two components, we ground the contrastive updates in segmentation information, and simultaneously improve segmentation throughout pretraining. Experiments show our representations transfer robustly to downstream tasks in classification, detection and segmentation.  ( 2 min )
    Example Perplexity. (arXiv:2203.08813v1 [cs.LG])
    Some examples are easier for humans to classify than others. The same should be true for deep neural networks (DNNs). We use the term example perplexity to refer to the level of difficulty of classifying an example. In this paper, we propose a method to measure the perplexity of an example and investigate what factors contribute to high example perplexity. The related codes and resources are available at https://github.com/vaynexie/Example-Perplexity.  ( 2 min )
    BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. (arXiv:2106.10199v3 [cs.LG] UPDATED)
    We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.  ( 2 min )
    AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation. (arXiv:2203.09516v1 [cs.CV])
    Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks. The project page with code and video visualizations can be found at https://yccyenchicheng.github.io/AutoSDF/.  ( 2 min )
    An Explainable Stacked Ensemble Model for Static Route-Free Estimation of Time of Arrival. (arXiv:2203.09438v1 [cs.LG])
    To compare alternative taxi schedules and to compute them, as well as to provide insights into an upcoming taxi trip to drivers and passengers, the duration of a trip or its Estimated Time of Arrival (ETA) is predicted. To reach a high prediction precision, machine learning models for ETA are state of the art. One yet unexploited option to further increase prediction precision is to combine multiple ETA models into an ensemble. While an increase of prediction precision is likely, the main drawback is that the predictions made by such an ensemble become less transparent due to the sophisticated ensemble architecture. One option to remedy this drawback is to apply eXplainable Artificial Intelligence (XAI). The contribution of this paper is three-fold. First, we combine multiple machine learning models from our previous work for ETA into a two-level ensemble model - a stacked ensemble model - which on its own is novel; therefore, we can outperform previous state-of-the-art static route-free ETA approaches. Second, we apply existing XAI methods to explain the first- and second-level models of the ensemble. Third, we propose three joining methods for combining the first-level explanations with the second-level ones. Those joining methods enable us to explain stacked ensembles for regression tasks. An experimental evaluation shows that the ETA models correctly learned the importance of those input features driving the prediction.  ( 2 min )
    ECONet: Efficient Convolutional Online Likelihood Network for Scribble-based Interactive Segmentation. (arXiv:2201.04584v3 [eess.IV] UPDATED)
    Automatic segmentation of lung lesions associated with COVID-19 in CT images requires large amount of annotated volumes. Annotations mandate expert knowledge and are time-intensive to obtain through fully manual segmentation methods. Additionally, lung lesions have large inter-patient variations, with some pathologies having similar visual appearance as healthy lung tissues. This poses a challenge when applying existing semi-automatic interactive segmentation techniques for data labelling. To address these challenges, we propose an efficient convolutional neural networks (CNNs) that can be learned online while the annotator provides scribble-based interaction. To accelerate learning from only the samples labelled through user-interactions, a patch-based approach is used for training the network. Moreover, we use weighted cross-entropy loss to address the class imbalance that may result from user-interactions. During online inference, the learned network is applied to the whole input volume using a fully convolutional approach. We compare our proposed method with state-of-the-art using synthetic scribbles and show that it outperforms existing methods on the task of annotating lung lesions associated with COVID-19, achieving 16% higher Dice score while reducing execution time by 3$\times$ and requiring 9000 lesser scribbles-based labelled voxels. Due to the online learning aspect, our approach adapts quickly to user input, resulting in high quality segmentation labels. Source code for ECONet is available at: https://github.com/masadcv/ECONet-MONAILabel.  ( 2 min )
    How Many Data Samples is an Additional Instruction Worth?. (arXiv:2203.09161v1 [cs.CL])
    Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language. Instruction-tuned models have significantly outperformed multitask learning models (without instruction); however they are far from state of the art task specific models. Conventional approaches to improve model performance via creating large datasets with lots of task instances or architectural/training changes in model may not be feasible for non-expert users. However, they can write alternate instructions to represent an instruction task. Is Instruction-augumentation helpful? We augment a subset of tasks in NATURAL INSTRUCTIONS with additional instructions and find that these significantly improve model performance (upto 35%) specially in low-data regime. Our results indicate that an additional instruction can be equivalent to ~40 instances on average across our evaluation tasks.  ( 2 min )
    Deep Reinforcement Learning-Based Long-Range Autonomous Valet Parking for Smart Cities. (arXiv:2109.11661v2 [cs.LG] UPDATED)
    In this paper, to reduce the congestion rate at the city center and increase the quality of experience (QoE) of each user, the framework of long-range autonomous valet parking (LAVP) is presented, where an Autonomous Vehicle (AV) is deployed in the city, which can pick up, drop off users at their required spots, and then drive to the car park out of city center autonomously. In this framework, we aim to minimize the overall distance of the AV, while guarantee all users are served, i.e., picking up, and dropping off users at their required spots through optimizing the path planning of the AV and number of serving time slots. To this end, we first propose a learning based algorithm, which is named as Double-Layer Ant Colony Optimization (DL-ACO) algorithm to solve the above problem in an iterative way. Then, to make the real-time decision, while consider the dynamic environment (i.e., the AV may pick up and drop off users from different locations), we further present a deep reinforcement learning (DRL) based algorithm, which is known as deep Q network (DQN). The experimental results show that the DL-ACO and DQN-based algorithms both achieve the considerable performance.  ( 2 min )
    Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions. (arXiv:2008.07343v4 [cs.CY] UPDATED)
    Artificial intelligence (AI) has been applied widely in our daily lives in a variety of ways with numerous success stories. AI has also contributed to dealing with the coronavirus disease (COVID-19) pandemic, which has been happening around the globe. This paper presents a survey of AI methods being used in various applications in the fight against the COVID-19 outbreak and outlines the crucial role of AI research in this unprecedented battle. We touch on areas where AI plays as an essential component, from medical image processing, data analytics, text mining and natural language processing, the Internet of Things, to computational biology and medicine. A summary of COVID-19 related data sources that are available for research purposes is also presented. Research directions on exploring the potential of AI and enhancing its capability and power in the pandemic battle are thoroughly discussed. We identify 13 groups of problems related to the COVID-19 pandemic and highlight promising AI methods and tools that can be used to address these problems. It is envisaged that this study will provide AI researchers and the wider community with an overview of the current status of AI applications, and motivate researchers to harness AI's potential in the fight against COVID-19.  ( 3 min )
    Phased Flight Trajectory Prediction with Deep Learning. (arXiv:2203.09033v1 [cs.LG])
    The unprecedented increase of commercial airlines and private jets over the next ten years presents a challenge for air traffic control. Precise flight trajectory prediction is of great significance in air transportation management, which contributes to the decision-making for safe and orderly flights. Existing research and application mainly focus on the sequence generation based on historical trajectories, while the aircraft-aircraft interactions in crowded airspace especially the airspaces near busy airports have been largely ignored. On the other hand, there are distinct characteristics of aerodynamics for different flight phases, and the trajectory may be affected by various uncertainties such as weather and advisories from air traffic controllers. However, there is no literature fully considers all these issues. Therefore, we proposed a phased flight trajectory prediction framework. Multi-source and multi-modal datasets have been analyzed and mined using variants of recurrent neural network (RNN) mixture. To be specific, we first introduce spatio temporal graphs into the low-altitude airway prediction problem, and the motion constraints of an aircraft are embedded to the inference process for reliable forecasting results. In the en-route phase, the dual attention mechanism is employed to adaptively extract much more important features from overall datasets to learn the hidden patterns in dynamical environments. The experimental results demonstrate our proposed framework can outperform state-of-the-art methods for flight trajectory prediction for large passenger/transport airplanes.  ( 2 min )
    Discovering the building blocks of dark matter halo density profiles with neural networks. (arXiv:2203.08827v1 [astro-ph.CO])
    The density profiles of dark matter halos are typically modeled using empirical formulae fitted to the density profiles of relaxed halo populations. We present a neural network model that is trained to learn the mapping from the raw density field containing each halo to the dark matter density profile. We show that the model recovers the widely-used Navarro-Frenk-White (NFW) profile out to the virial radius, and can additionally describe the variability in the outer profile of the halos. The neural network architecture consists of a supervised encoder-decoder framework, which first compresses the density inputs into a low-dimensional latent representation, and then outputs $\rho(r)$ for any desired value of radius $r$. The latent representation contains all the information used by the model to predict the density profiles. This allows us to interpret the latent representation by quantifying the mutual information between the representation and the halos' ground-truth density profiles. A two-dimensional representation is sufficient to accurately model the density profiles up to the virial radius; however, a three-dimensional representation is required to describe the outer profiles beyond the virial radius. The additional dimension in the representation contains information about the infalling material in the outer profiles of dark matter halos, thus discovering the splashback boundary of halos without prior knowledge of the halos' dynamical history.  ( 2 min )
    Hierarchical Clustering and Matrix Completion for the Reconstruction of World Input-Output Tables. (arXiv:2203.08819v1 [stat.ML])
    World Input-Output (I/O) matrices provide the networks of within- and cross-country economic relations. In the context of I/O analysis, the methodology adopted by national statistical offices in data collection raises the issue of obtaining reliable data in a timely fashion and it makes the reconstruction of (part of) the I/O matrices of particular interest. In this work, we propose a method combining hierarchical clustering and Matrix Completion (MC) with a LASSO-like nuclear norm penalty, to impute missing entries of a partially unknown I/O matrix. Through simulations based on synthetic matrices we study the effectiveness of the proposed method to predict missing values from both previous years data and current data related to countries similar to the one for which current data are obscured. To show the usefulness of our method, an application based on World Input-Output Database (WIOD) tables - which are an example of industry-by-industry I/O tables - is provided. Strong similarities in structure between WIOD and other I/O tables are also found, which make the proposed approach easily generalizable to them.  ( 2 min )
    A Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusion Problems. (arXiv:2203.09436v1 [math.OC])
    We study stochastic monotone inclusion problems, which widely appear in machine learning applications, including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction. In the cocoercive -- and more generally Lipschitz-monotone -- setup, our algorithm attains $\epsilon$ norm of the operator with $\mathcal{O}(\frac{1}{\epsilon^3})$ stochastic operator evaluations, which significantly improves over state of the art $\mathcal{O}(\frac{1}{\epsilon^4})$ stochastic operator evaluations required for existing monotone inclusion solvers applied to the same problem classes. We further show how to couple one of the proposed variants of stochastic Halpern iteration with a scheduled restart scheme to solve stochastic monotone inclusion problems with ${\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})$ stochastic operator evaluations under additional sharpness or strong monotonicity assumptions. Finally, we argue via reductions between different problem classes that our stochastic oracle complexity bounds are tight up to logarithmic factors in terms of their $\epsilon$-dependence.  ( 2 min )
    Latent-Variable Advantage-Weighted Policy Optimization for Offline RL. (arXiv:2203.08949v1 [cs.LG])
    Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets can exacerbate the distribution shift between the behavior policy underlying the data and the optimal policy to be learned, leading to poor performance. To address this challenge, we propose to leverage latent-variable policies that can represent a broader class of policy distributions, leading to better adherence to the training data distribution while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets, and by 8% on datasets with narrow and biased distributions.  ( 2 min )
    Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling. (arXiv:2203.08810v1 [eess.AS])
    Speech Emotion Recognition (SER) application is frequently associated with privacy concerns as it often acquires and transmits speech data at the client-side to remote cloud platforms for further processing. These speech data can reveal not only speech content and affective information but the speaker's identity, demographic traits, and health status. Federated learning (FL) is a distributed machine learning algorithm that coordinates clients to train a model collaboratively without sharing local data. This algorithm shows enormous potential for SER applications as sharing raw speech or speech features from a user's device is vulnerable to privacy attacks. However, a major challenge in FL is limited availability of high-quality labeled data samples. In this work, we propose a semi-supervised federated learning framework, Semi-FedSER, that utilizes both labeled and unlabeled data samples to address the challenge of limited labeled data samples in FL. We show that our Semi-FedSER can generate desired SER performance even when the local label rate l=20 using two SER benchmark datasets: IEMOCAP and MSP-Improv.  ( 2 min )
    Self-Supervised Deep Learning to Enhance Breast Cancer Detection on Screening Mammography. (arXiv:2203.08812v1 [eess.IV])
    A major limitation in applying deep learning to artificial intelligence (AI) systems is the scarcity of high-quality curated datasets. We investigate strong augmentation based self-supervised learning (SSL) techniques to address this problem. Using breast cancer detection as an example, we first identify a mammogram-specific transformation paradigm and then systematically compare four recent SSL methods representing a diversity of approaches. We develop a method to convert a pretrained model from making predictions on uniformly tiled patches to whole images, and an attention-based pooling method that improves the classification performance. We found that the best SSL model substantially outperformed the baseline supervised model. The best SSL model also improved the data efficiency of sample labeling by nearly 4-fold and was highly transferrable from one dataset to another. SSL represents a major breakthrough in computer vision and may help the AI for medical imaging field to shift away from supervised learning and dependency on scarce labels.  ( 2 min )
    Neural network processing of holographic images. (arXiv:2203.08898v1 [eess.IV])
    HOLODEC, an airborne cloud particle imager, captures holographic images of a fixed volume of cloud to characterize the types and sizes of cloud particles, such as water droplets and ice crystals. Cloud particle properties include position, diameter, and shape. We present a hologram processing algorithm, HolodecML, that utilizes a neural segmentation model, GPUs, and computational parallelization. HolodecML is trained using synthetically generated holograms based on a model of the instrument, and predicts masks around particles found within reconstructed images. From these masks, the position and size of the detected particles can be characterized in three dimensions. In order to successfully process real holograms, we find we must apply a series of image corrupting transformations and noise to the synthetic images used in training. In this evaluation, HolodecML had comparable position and size estimation performance to the standard processing method, but improved particle detection by nearly 20\% on several thousand manually labeled HOLODEC images. However, the improvement only occurred when image corruption was performed on the simulated images during training, thereby mimicking non-ideal conditions in the actual probe. The trained model also learned to differentiate artifacts and other impurities in the HOLODEC images from the particles, even though no such objects were present in the training data set, while the standard processing method struggled to separate particles from artifacts. The novelty of the training approach, which leveraged noise as a means for parameterizing non-ideal aspects of the HOLODEC detector, could be applied in other domains where the theoretical model is incapable of fully describing the real-world operation of the instrument and accurate truth data required for supervised learning cannot be obtained from real-world observations.  ( 2 min )
    QUBOs for Sorting Lists and Building Trees. (arXiv:2203.08815v1 [cs.DS])
    We show that the fundamental tasks of sorting lists and building search trees or heaps can be modeled as quadratic unconstrained binary optimization problems (QUBOs). The idea is to understand these tasks as permutation problems and to devise QUBOs whose solutions represent appropriate permutation matrices. We discuss how to construct such QUBOs and how to solve them using Hopfield nets or adiabatic) quantum computing. In short, we show that neurocomputing methods or quantum computers can solve problems usually associated with abstract data structures.  ( 2 min )
    A Survey of Multi-Agent Reinforcement Learning with Communication. (arXiv:2203.08975v1 [cs.MA])
    Communication is an effective mechanism for coordinating the behavior of multiple agents. In the field of multi-agent reinforcement learning, agents can improve the overall learning performance and achieve their objectives by communication. Moreover, agents can communicate various types of messages, either to all agents or to specific agent groups, and through specific channels. With the growing body of research work in MARL with communication (Comm-MARL), there is lack of a systematic and structural approach to distinguish and classify existing Comm-MARL systems. In this paper, we survey recent works in the Comm-MARL field and consider various aspects of communication that can play a role in the design and development of multi-agent reinforcement learning systems. With these aspects in mind, we propose several dimensions along which Comm-MARL systems can be analyzed, developed, and compared.  ( 2 min )
    Time Dependency, Data Flow, and Competitive Advantage. (arXiv:2203.09128v1 [cs.LG])
    Data is fundamental to machine learning-based products and services and is considered strategic due to its externalities for businesses, governments, non-profits, and more generally for society. It is renowned that the value of organizations (businesses, government agencies and programs, and even industries) scales with the volume of available data. What is often less appreciated is that the data value in making useful organizational predictions will range widely and is prominently a function of data characteristics and underlying algorithms. In this research, our goal is to study how the value of data changes over time and how this change varies across contexts and business areas (e.g. next word prediction in the context of history, sports, politics). We focus on data from Reddit.com and compare the value's time-dependency across various Reddit topics (Subreddits). We make this comparison by measuring the rate at which user-generated text data loses its relevance to the algorithmic prediction of conversations. We show that different subreddits have different rates of relevance decline over time. Relating the text topics to various business areas of interest, we argue that competing in a business area in which data value decays rapidly alters strategies to acquire competitive advantage. When data value decays rapidly, access to a continuous flow of data will be more valuable than access to a fixed stock of data. In this kind of setting, improving user engagement and increasing user-base help creating and maintaining a competitive advantage.  ( 2 min )
    MotionAug: Augmentation with Physical Correction for Human Motion Prediction. (arXiv:2203.09116v1 [cs.CV])
    This paper presents a motion data augmentation scheme incorporating motion synthesis encouraging diversity and motion correction imposing physical plausibility. This motion synthesis consists of our modified Variational AutoEncoder (VAE) and Inverse Kinematics (IK). In this VAE, our proposed sampling-near-samples method generates various valid motions even with insufficient training motion data. Our IK-based motion synthesis method allows us to generate a variety of motions semi-automatically. Since these two schemes generate unrealistic artifacts in the synthesized motions, our motion correction rectifies them. This motion correction scheme consists of imitation learning with physics simulation and subsequent motion debiasing. For this imitation learning, we propose the PD-residual force that significantly accelerates the training process. Furthermore, our motion debiasing successfully offsets the motion bias induced by imitation learning to maximize the effect of augmentation. As a result, our method outperforms previous noise-based motion augmentation methods by a large margin on both Recurrent Neural Network-based and Graph Convolutional Network-based human motion prediction models. The code is available at {\rm \url{https://github.com/meaten/MotionAug}}.  ( 2 min )
    DePS: An improved deep learning model for de novo peptide sequencing. (arXiv:2203.08820v1 [q-bio.QM])
    De novo peptide sequencing from mass spectrometry data is an important method for protein identification. Recently, various deep learning approaches were applied for de novo peptide sequencing and DeepNovoV2 is one of the represetative models. In this study, we proposed an enhanced model, DePS, which can improve the accuracy of de novo peptide sequencing even with missing signal peaks or large number of noisy peaks in tandem mass spectrometry data. It is showed that, for the same test set of DeepNovoV2, the DePS model achieved excellent results of 74.22%, 74.21% and 41.68% for amino acid recall, amino acid precision and peptide recall respectively. Furthermore, the results suggested that DePS outperforms DeepNovoV2 on the cross species dataset.  ( 2 min )
    Physics-Informed Neural Networks with Adaptive Localized Artificial Viscosity. (arXiv:2203.08802v1 [physics.flu-dyn])
    Physics-informed Neural Network (PINN) is a promising tool that has been applied in a variety of physical phenomena described by partial differential equations (PDE). However, it has been observed that PINNs are difficult to train in certain "stiff" problems, which include various nonlinear hyperbolic PDEs that display shocks in their solutions. Recent studies added a diffusion term to the PDE, and an artificial viscosity (AV) value was manually tuned to allow PINNs to solve these problems. In this paper, we propose three approaches to address this problem, none of which rely on an a priori definition of the artificial viscosity value. The first method learns a global AV value, whereas the other two learn localized AV values around the shocks, by means of a parametrized AV map or a residual-based AV map. We applied the proposed methods to the inviscid Burgers equation and the Buckley-Leverett equation, the latter being a classical problem in Petroleum Engineering. The results show that the proposed methods are able to learn both a small AV value and the accurate shock location and improve the approximation error over a nonadaptive global AV alternative method.  ( 2 min )
    Understanding robustness and generalization of artificial neural networks through Fourier masks. (arXiv:2203.08822v1 [cs.CV])
    Despite the enormous success of artificial neural networks (ANNs) in many disciplines, the characterization of their computations and the origin of key properties such as generalization and robustness remain open questions. Recent literature suggests that robust networks with good generalization properties tend to be biased towards processing low frequencies in images. To explore the frequency bias hypothesis further, we develop an algorithm that allows us to learn modulatory masks highlighting the essential input frequencies needed for preserving a trained network's performance. We achieve this by imposing invariance in the loss with respect to such modulations in the input frequencies. We first use our method to test the low-frequency preference hypothesis of adversarially trained or data-augmented networks. Our results suggest that adversarially robust networks indeed exhibit a low-frequency bias but we find this bias is also dependent on directions in frequency space. However, this is not necessarily true for other types of data augmentation. Our results also indicate that the essential frequencies in question are effectively the ones used to achieve generalization in the first place. Surprisingly, images seen through these modulatory masks are not recognizable and resemble texture-like patterns.  ( 2 min )
    Neural-Network-Directed Genetic Programmer for Discovery of Governing Equations. (arXiv:2203.08808v1 [cs.NE])
    We develop a symbolic regression framework for extracting the governing mathematical expressions from observed data. The evolutionary approach, faiGP, is designed to leverage the properties of a function algebra that have been encoded into a grammar, providing a theoretical guarantee of universal approximation and a way to minimize bloat. In this framework, the choice of operators of the grammar may be informed by a physical theory or symmetry considerations. Since there is currently no theory that can derive the 'constants of nature', an empirical investigation on extracting these coefficients from an evolutionary process is of methodological interest. We quantify the impact of different types of regularizers, including a diversity metric adapted from studies of the transcriptome and a complexity measure, on the performance of the framework. Our implementation, which leverages neural networks and a genetic programmer, generates non-trivial symbolically equivalent expressions ("Ramanujan expressions") or approximations with potentially interesting numerical applications. To illustrate the framework, a model of ligand-receptor binding kinetics, including an account of gene regulation by transcription factors, and a model of the regulatory range of the cistrome from omics data are presented. This study has important implications on the development of data-driven methodologies for the discovery of governing equations in experimental data derived from new sensing systems and high-throughput screening technologies.  ( 2 min )
    New directions for surrogate models and differentiable programming for High Energy Physics detector simulation. (arXiv:2203.08806v1 [hep-ph])
    The computational cost for high energy physics detector simulation in future experimental facilities is going to exceed the current available resources. To overcome this challenge, new ideas on surrogate models using machine learning methods are being explored to replace computationally expensive components. Additionally, differentiable programming has been proposed as a complementary approach, providing controllable and scalable simulation routines. In this document, new and ongoing efforts for surrogate models and differential programming applied to detector simulation are discussed in the context of the 2021 Particle Physics Community Planning Exercise (`Snowmass').  ( 2 min )
    Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set. (arXiv:2203.08807v1 [eess.IV])
    Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. Artificial intelligence (AI) may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and diverse DDI images closed the performance gap between light and dark skin tones. Moreover, algorithms fine-tuned on diverse skin tones outperformed dermatologists on identifying malignancy on images of dark skin tones. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and diseases.  ( 2 min )
  • Open

    Electronic excited states in deep variational Monte Carlo. (arXiv:2203.09472v1 [physics.chem-ph])
    Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the electronic Schr\"odinger equation that scales favorably for large systems and whose accuracy is limited only by the choice of wavefunction ansatz employed is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach, using a new class of ansatzes represented by deep neural networks, has been shown to generate nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we advance one such ansatz (PauliNet) to compute electronic excited states through a simple variational procedure. We demonstrate our method on a variety of small atoms and molecules where we consistently achieve high accuracy for low-lying states. To highlight the method's potential for larger systems, we show that for the benzene molecule, PauliNet is on par with significantly more expensive high-level electronic structure methods in terms of the excitation energy and outperforms them in terms of absolute energies.  ( 2 min )
    On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. (arXiv:2203.09168v1 [cs.LG])
    Capturing aleatoric uncertainty is a critical part of many machine learning systems. In deep learning, a common approach to this end is to train a neural network to estimate the parameters of a heteroscedastic Gaussian distribution by maximizing the logarithm of the likelihood function under the observed data. In this work, we examine this approach and identify potential hazards associated with the use of log-likelihood in conjunction with gradient-based optimizers. First, we present a synthetic example illustrating how this approach can lead to very poor but stable parameter estimates. Second, we identify the culprit to be the log-likelihood loss, along with certain conditions that exacerbate the issue. Third, we present an alternative formulation, termed $\beta$-NLL, in which each data point's contribution to the loss is weighted by the $\beta$-exponentiated variance estimate. We show that using an appropriate $\beta$ largely mitigates the issue in our illustrative example. Fourth, we evaluate this approach on a range of domains and tasks and show that it achieves considerable improvements and performs more robustly concerning hyperparameters, both in predictive RMSE and log-likelihood criteria.  ( 2 min )
    On Redundancy and Diversity in Cell-based Neural Architecture Search. (arXiv:2203.08887v1 [stat.ML])
    Searching for the architecture cells is a dominant paradigm in NAS. However, little attention has been devoted to the analysis of the cell-based search spaces even though it is highly important for the continual development of NAS. In this work, we conduct an empirical post-hoc analysis of architectures from the popular cell-based search spaces and find that the existing search spaces contain a high degree of redundancy: the architecture performance is minimally sensitive to changes at large parts of the cells, and universally adopted designs, like the explicit search for a reduction cell, significantly increase the complexities but have very limited impact on the performance. Across architectures found by a diverse set of search strategies, we consistently find that the parts of the cells that do matter for architecture performance often follow similar and simple patterns. By explicitly constraining cells to include these patterns, randomly sampled architectures can match or even outperform the state of the art. These findings cast doubts into our ability to discover truly novel architectures in the existing cell-based search spaces, and inspire our suggestions for improvement to guide future NAS research. Code is available at https://github.com/xingchenwan/cell-based-NAS-analysis.  ( 2 min )
    An Explainable Stacked Ensemble Model for Static Route-Free Estimation of Time of Arrival. (arXiv:2203.09438v1 [cs.LG])
    To compare alternative taxi schedules and to compute them, as well as to provide insights into an upcoming taxi trip to drivers and passengers, the duration of a trip or its Estimated Time of Arrival (ETA) is predicted. To reach a high prediction precision, machine learning models for ETA are state of the art. One yet unexploited option to further increase prediction precision is to combine multiple ETA models into an ensemble. While an increase of prediction precision is likely, the main drawback is that the predictions made by such an ensemble become less transparent due to the sophisticated ensemble architecture. One option to remedy this drawback is to apply eXplainable Artificial Intelligence (XAI). The contribution of this paper is three-fold. First, we combine multiple machine learning models from our previous work for ETA into a two-level ensemble model - a stacked ensemble model - which on its own is novel; therefore, we can outperform previous state-of-the-art static route-free ETA approaches. Second, we apply existing XAI methods to explain the first- and second-level models of the ensemble. Third, we propose three joining methods for combining the first-level explanations with the second-level ones. Those joining methods enable us to explain stacked ensembles for regression tasks. An experimental evaluation shows that the ETA models correctly learned the importance of those input features driving the prediction.  ( 2 min )
    Sensitivity Estimation for Dark Matter Subhalos in Synthetic Gaia DR2 using Deep Learning. (arXiv:2203.08161v1 [astro-ph.GA] CROSS LISTED)
    The abundance of dark matter subhalos orbiting a host galaxy is a generic prediction of the cosmological framework. It is a promising way to constrain the nature of dark matter. Here we describe the challenges of detecting stars whose phase-space distribution may be perturbed by the passage of dark matter subhalos using a machine learning approach. The training data are three Milky Way-like galaxies and nine synthetic Gaia DR2 surveys derived from these. We first quantify the magnitude of the perturbations in the simulated galaxies using an anomaly detection algorithm. We also estimate the feasibility of this approach in the Gaia DR2-like catalogues by comparing the anomaly detection based approach with a supervised classification. We find that a classification algorithm optimized on about half a billion synthetic star observables exhibits mild but nonzero sensitivity. This classification-based approach is not sufficiently sensitive to pinpoint the exact locations of subhalos in the simulation, as would be expected from the very limited number of subhalos in the detectable region. The enormous size of the Gaia dataset motivates the further development of scalable and accurate computational methods that could be used to select potential regions of interest for dark matter searches to ultimately constrain the Milky Way's subhalo mass function.  ( 2 min )
    Noisy Tensor Completion via Low-rank Tensor Ring. (arXiv:2203.08857v1 [stat.ML])
    Tensor completion is a fundamental tool for incomplete data analysis, where the goal is to predict missing entries from partial observations. However, existing methods often make the explicit or implicit assumption that the observed entries are noise-free to provide a theoretical guarantee of exact recovery of missing entries, which is quite restrictive in practice. To remedy such drawbacks, this paper proposes a novel noisy tensor completion model, which complements the incompetence of existing works in handling the degeneration of high-order and noisy observations. Specifically, the tensor ring nuclear norm (TRNN) and least-squares estimator are adopted to regularize the underlying tensor and the observed entries, respectively. In addition, a non-asymptotic upper bound of estimation error is provided to depict the statistical performance of the proposed estimator. Two efficient algorithms are developed to solve the optimization problem with convergence guarantee, one of which is specially tailored to handle large-scale tensors by replacing the minimization of TRNN of the original tensor equivalently with that of a much smaller one in a heterogeneous tensor decomposition framework. Experimental results on both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed model in recovering noisy incomplete tensor data compared with state-of-the-art tensor completion models.  ( 2 min )
    Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model. (arXiv:2101.05467v3 [cs.LG] UPDATED)
    The drastic increase of data quantity often brings the severe decrease of data quality, such as incorrect label annotations, which poses a great challenge for robustly training Deep Neural Networks (DNNs). Existing learning \mbox{methods} with label noise either employ ad-hoc heuristics or restrict to specific noise assumptions. However, more general situations, such as instance-dependent label noise, have not been fully explored, as scarce studies focus on their label corruption process. By categorizing instances into confusing and unconfusing instances, this paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances. The resultant model can be realized by DNNs, where the training procedure is accomplished by employing an alternating optimization algorithm. Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness over state-of-the-art counterparts.  ( 2 min )
    Dimensionality Reduction and Wasserstein Stability for Kernel Regression. (arXiv:2203.09347v1 [stat.ML])
    In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable. More specifically we combine principal component analysis (PCA) with kernel regression. In order to analyze the resulting regression errors, a novel stability result of kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit a kernel function. We combine the stability result with known estimates from the literature on both principal component analysis and kernel regression to obtain convergence rates for the two-step procedure.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v4 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Bandit Labor Training. (arXiv:2006.06853v5 [cs.LG] UPDATED)
    On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs. Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones. However, the learning rates of novice workers are unknown, resulting in a tradeoff between exploration (learning the learning rates) and exploitation (training the best workers). Motivated to study this tradeoff, we analyze a novel objective within the stochastic multi-armed bandit framework. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest cumulative reward (the "max" objective). When rewards represent skill increments, this corresponds to the objective of training a single highly skilled worker from a set of novice workers, using a limited supply of training jobs. For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant) and a worst-case regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We generalize our algorithmic insights to the problem of maximizing the expected value of the average cumulative reward of the top $m$ arms with the highest cumulative rewards, corresponding to the case where multiple workers must be trained. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes.  ( 3 min )
    Diffusion Probabilistic Modeling for Video Generation. (arXiv:2203.09481v1 [cs.CV])
    Denoising diffusion probabilistic models are a promising new class of generative models that are competitive with GANs on perceptual metrics. In this paper, we explore their potential for sequentially generating video. Inspired by recent advances in neural video compression, we use denoising diffusion models to stochastically generate a residual to a deterministic next-frame prediction. We compare this approach to two sequential VAE and two GAN baselines on four datasets, where we test the generated frames for perceptual quality and forecasting accuracy against ground truth frames. We find significant improvements in terms of perceptual quality on all data and improvements in terms of frame forecasting for complex high-resolution videos.  ( 2 min )
    Transfer learning for cross-modal demand prediction of bike-share and public transit. (arXiv:2203.09279v1 [cs.LG])
    The urban transportation system is a combination of multiple transport modes, and the interdependencies across those modes exist. This means that the travel demand across different travel modes could be correlated as one mode may receive demand from or create demand for another mode, not to mention natural correlations between different demand time series due to general demand flow patterns across the network. It is expectable that cross-modal ripple effects become more prevalent, with Mobility as a Service. Therefore, by propagating demand data across modes, a better demand prediction could be obtained. To this end, this study explores various machine learning models and transfer learning strategies for cross-modal demand prediction. The trip data of bike-share, metro, and taxi are processed as the station-level passenger flows, and then the proposed prediction method is tested in the large-scale case studies of Nanjing and Chicago. The results suggest that prediction models with transfer learning perform better than unimodal prediction models. Furthermore, stacked Long Short-Term Memory model performs particularly well in cross-modal demand prediction. These results verify our combined method's forecasting improvement over existing benchmarks and demonstrate the good transferability for cross-modal demand prediction in multiple cities.  ( 2 min )
    Equivariant Subgraph Aggregation Networks. (arXiv:2110.02910v3 [cs.LG] UPDATED)
    Message-passing neural networks (MPNNs) are the leading architecture for deep learning on graph-structured data, in large part due to their simplicity and scalability. Unfortunately, it was shown that these architectures are limited in their expressive power. This paper proposes a novel framework called Equivariant Subgraph Aggregation Networks (ESAN) to address this issue. Our main observation is that while two graphs may not be distinguishable by an MPNN, they often contain distinguishable subgraphs. Thus, we propose to represent each graph as a set of subgraphs derived by some predefined policy, and to process it using a suitable equivariant architecture. We develop novel variants of the 1-dimensional Weisfeiler-Leman (1-WL) test for graph isomorphism, and prove lower bounds on the expressiveness of ESAN in terms of these new WL variants. We further prove that our approach increases the expressive power of both MPNNs and more expressive architectures. Moreover, we provide theoretical results that describe how design choices such as the subgraph selection policy and equivariant neural architecture affect our architecture's expressive power. To deal with the increased computational cost, we propose a subgraph sampling scheme, which can be viewed as a stochastic version of our framework. A comprehensive set of experiments on real and synthetic datasets demonstrates that our framework improves the expressive power and overall performance of popular GNN architectures.  ( 2 min )
    Stability and Risk Bounds of Iterative Hard Thresholding. (arXiv:2203.09413v1 [stat.ML])
    In this paper, we analyze the generalization performance of the Iterative Hard Thresholding (IHT) algorithm widely used for sparse recovery problems. The parameter estimation and sparsity recovery consistency of IHT has long been known in compressed sensing. From the perspective of statistical learning, another fundamental question is how well the IHT estimation would predict on unseen data. This paper makes progress towards answering this open question by introducing a novel sparse generalization theory for IHT under the notion of algorithmic stability. Our theory reveals that: 1) under natural conditions on the empirical risk function over $n$ samples of dimension $p$, IHT with sparsity level $k$ enjoys an $\mathcal{\tilde O}(n^{-1/2}\sqrt{k\log(n)\log(p)})$ rate of convergence in sparse excess risk; 2) a tighter $\mathcal{\tilde O}(n^{-1/2}\sqrt{\log(n)})$ bound can be established by imposing an additional iteration stability condition on a hypothetical IHT procedure invoked to the population risk; and 3) a fast rate of order $\mathcal{\tilde O}\left(n^{-1}k(\log^3(n)+\log(p))\right)$ can be derived for strongly convex risk function under proper strong-signal conditions. The results have been substantialized to sparse linear regression and sparse logistic regression models to demonstrate the applicability of our theory. Preliminary numerical evidence is provided to confirm our theoretical predictions.  ( 2 min )
    Kan Extensions in Data Science and Machine Learning. (arXiv:2203.09018v1 [cs.LG])
    A common problem in data science is "use this function defined over this small set to generate predictions over that larger set." Extrapolation, interpolation, statistical inference and forecasting all reduce to this problem. The Kan extension is a powerful tool in category theory that generalizes this notion. In this work we explore several applications of Kan extensions to data science. We begin by deriving a simple classification algorithm as a Kan extension and experimenting with this algorithm on real data. Next, we use the Kan extension to derive a procedure for learning clustering algorithms from labels and explore the performance of this procedure on real data. We then investigate how Kan extensions can be used to learn a general mapping from datasets of labeled examples to functions and to approximate a complex function with a simpler one.  ( 2 min )
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v1 [cs.LG])
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.  ( 2 min )
    Confidence Dimension for Deep Learning based on Hoeffding Inequality and Relative Evaluation. (arXiv:2203.09082v1 [cs.LG])
    Research on the generalization ability of deep neural networks (DNNs) has recently attracted a great deal of attention. However, due to their complex architectures and large numbers of parameters, measuring the generalization ability of specific DNN models remains an open challenge. In this paper, we propose to use multiple factors to measure and rank the relative generalization of DNNs based on a new concept of confidence dimension (CD). Furthermore, we provide a feasible framework in our CD to theoretically calculate the upper bound of generalization based on the conventional Vapnik-Chervonenk dimension (VC-dimension) and Hoeffding's inequality. Experimental results on image classification and object detection demonstrate that our CD can reflect the relative generalization ability for different DNNs. In addition to full-precision DNNs, we also analyze the generalization ability of binary neural networks (BNNs), whose generalization ability remains an unsolved problem. Our CD yields a consistent and reliable measure and ranking for both full-precision DNNs and BNNs on all the tasks.  ( 2 min )
    Near Instance Optimal Model Selection for Pure Exploration Linear Bandits. (arXiv:2109.05131v2 [stat.ML] UPDATED)
    We introduce the model selection problem in pure exploration linear bandits, where the learner needs to adapt to the instance-dependent complexity measure of the smallest hypothesis class containing the true model. We design algorithms in both fixed confidence and fixed budget settings with near instance optimal guarantees. The core of our algorithms is a new optimization problem based on experimental design that leverages the geometry of the action set to identify a near-optimal hypothesis class. Our fixed budget algorithm is developed based on a novel selection-validation procedure, which provides a new way to study the understudied fixed budget setting (even without the added challenge of model selection). We adapt our algorithms, in both fixed confidence and fixed budget settings, to problems with model misspecification.  ( 2 min )
    Pure Exploration in Kernel and Neural Bandits. (arXiv:2106.12034v2 [stat.ML] UPDATED)
    We study pure exploration in bandits, where the dimension of the feature representation can be much larger than the number of arms. To overcome the curse of dimensionality, we propose to adaptively embed the feature representation of each arm into a lower-dimensional space and carefully deal with the induced model misspecification. Our approach is conceptually very different from existing works that can either only handle low-dimensional linear bandits or passively deal with model misspecification. We showcase the application of our approach to two pure exploration settings that were previously under-studied: (1) the reward function belongs to a possibly infinite-dimensional Reproducing Kernel Hilbert Space, and (2) the reward function is nonlinear and can be approximated by neural networks. Our main results provide sample complexity guarantees that only depend on the effective dimension of the feature spaces in the kernel or neural representations. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the efficacy of our methods.  ( 2 min )
    MoReL: Multi-omics Relational Learning. (arXiv:2203.08149v1 [q-bio.QM] CROSS LISTED)
    Multi-omics data analysis has the potential to discover hidden molecular interactions, revealing potential regulatory and/or signal transduction pathways for cellular processes of interest when studying life and disease systems. One of critical challenges when dealing with real-world multi-omics data is that they may manifest heterogeneous structures and data quality as often existing data may be collected from different subjects under different conditions for each type of omics data. We propose a novel deep Bayesian generative model to efficiently infer a multi-partite graph encoding molecular interactions across such heterogeneous views, using a fused Gromov-Wasserstein (FGW) regularization between latent representations of corresponding views for integrative analysis. With such an optimal transport regularization in the deep Bayesian generative model, it not only allows incorporating view-specific side information, either with graph-structured or unstructured data in different views, but also increases the model flexibility with the distribution-based regularization. This allows efficient alignment of heterogeneous latent variable distributions to derive reliable interaction predictions compared to the existing point-based graph embedding methods. Our experiments on several real-world datasets demonstrate enhanced performance of MoReL in inferring meaningful interactions compared to existing baselines.  ( 2 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v1 [cs.LG] CROSS LISTED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Augmented Sliced Wasserstein Distances. (arXiv:2006.08812v7 [cs.LG] UPDATED)
    While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through the random projection, yet they suffer from low accuracy if the number of projections is not sufficiently large, because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.  ( 2 min )
    Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot Learning. (arXiv:2203.09137v1 [cs.LG])
    Model-agnostic meta-learning (MAML) and its variants have become popular approaches for few-shot learning. However, due to the non-convexity of deep neural nets (DNNs) and the bi-level formulation of MAML, the theoretical properties of MAML with DNNs remain largely unknown. In this paper, we first prove that MAML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. Our convergence analysis indicates that MAML with over-parameterized DNNs is equivalent to kernel regression with a novel class of kernels, which we name as Meta Neural Tangent Kernels (MetaNTK). Then, we propose MetaNTK-NAS, a new training-free neural architecture search (NAS) method for few-shot learning that uses MetaNTK to rank and select architectures. Empirically, we compare our MetaNTK-NAS with previous NAS methods on two popular few-shot learning benchmarks, miniImageNet, and tieredImageNet. We show that the performance of MetaNTK-NAS is comparable or better than the state-of-the-art NAS method designed for few-shot learning while enjoying more than 100x speedup. We believe the efficiency of MetaNTK-NAS makes itself more practical for many real-world tasks.  ( 2 min )
    The Mathematics of Artificial Intelligence. (arXiv:2203.08890v1 [cs.LG])
    We currently witness the spectacular success of artificial intelligence in both science and public life. However, the development of a rigorous mathematical foundation is still at an early stage. In this survey article, which is based on an invited lecture at the International Congress of Mathematicians 2022, we will in particular focus on the current "workhorse" of artificial intelligence, namely deep neural networks. We will present the main theoretical directions along with several exemplary results and discuss key open problems.  ( 2 min )
    Adaptive Graph Auto-Encoder for General Data Clustering. (arXiv:2002.08648v5 [cs.LG] UPDATED)
    Graph-based clustering plays an important role in the clustering area. Recent studies about graph convolution neural networks have achieved impressive success on graph type data. However, in general clustering tasks, the graph structure of data does not exist such that the strategy to construct a graph is crucial for performance. Therefore, how to extend graph convolution networks into general clustering tasks is an attractive problem. In this paper, we propose a graph auto-encoder for general data clustering, which constructs the graph adaptively according to the generative perspective of graphs. The adaptive process is designed to induce the model to exploit the high-level information behind data and utilize the non-Euclidean structure sufficiently. We further design a novel mechanism with rigorous analysis to avoid the collapse caused by the adaptive construction. Via combining the generative model for network embedding and graph-based clustering, a graph auto-encoder with a novel decoder is developed such that it performs well in weighted graph used scenarios. Extensive experiments prove the superiority of our model.  ( 2 min )
    Deep Generative Models in Engineering Design: A Review. (arXiv:2110.10863v4 [cs.LG] UPDATED)
    Automated design synthesis has the potential to revolutionize the modern engineering design process and improve access to highly optimized and customized products across countless industries. Successfully adapting generative Machine Learning to design engineering may enable such automated design synthesis and is a research subject of great importance. We present a review and analysis of Deep Generative Machine Learning models in engineering design. Deep Generative Models (DGMs) typically leverage deep networks to learn from an input dataset and synthesize new designs. Recently, DGMs such as feedforward Neural Networks (NNs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and certain Deep Reinforcement Learning (DRL) frameworks have shown promising results in design applications like structural optimization, materials design, and shape synthesis. The prevalence of DGMs in engineering design has skyrocketed since 2016. Anticipating continued growth, we conduct a review of recent advances to benefit researchers interested in DGMs for design. We structure our review as an exposition of the algorithms, datasets, representation methods, and applications commonly used in the current literature. In particular, we discuss key works that have introduced new techniques and methods in DGMs, successfully applied DGMs to a design-related domain, or directly supported the development of DGMs through datasets or auxiliary methods. We further identify key challenges and limitations currently seen in DGMs across design fields, such as design creativity, handling constraints and objectives, and modeling both form and functional performance simultaneously. In our discussion, we identify possible solution pathways as key areas on which to target future work.  ( 2 min )
    Maximum Likelihood Estimation in Gaussian Process Regression is Ill-Posed. (arXiv:2203.09179v1 [math.ST])
    Gaussian process regression underpins countless academic and industrial applications of machine learning and statistics, with maximum likelihood estimation routinely used to select appropriate parameters for the covariance kernel. However, it remains an open problem to establish the circumstances in which maximum likelihood estimation is well-posed. That is, when the predictions of the regression model are continuous (or insensitive to small perturbations) in the training data. This article presents a rigorous proof that the maximum likelihood estimator fails to be well-posed in Hellinger distance in a scenario where the data are noiseless. The failure case occurs for any Gaussian process with a stationary covariance function whose lengthscale parameter is estimated using maximum likelihood. Although the failure of maximum likelihood estimation is informally well-known, these theoretical results appear to be the first of their kind, and suggest that well-posedness may need to be assessed post-hoc, on a case-by-case basis, when maximum likelihood estimation is used to train a Gaussian process model.  ( 2 min )
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v1 [cs.CV])
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    Mixing Up Contrastive Learning: Self-Supervised Representation Learning for Time Series. (arXiv:2203.09270v1 [stat.ML])
    The lack of labeled data is a key challenge for learning useful representation from time series data. However, an unsupervised representation framework that is capable of producing high quality representations could be of great value. It is key to enabling transfer learning, which is especially beneficial for medical applications, where there is an abundance of data but labeling is costly and time consuming. We propose an unsupervised contrastive learning framework that is motivated from the perspective of label smoothing. The proposed approach uses a novel contrastive loss that naturally exploits a data augmentation scheme in which new samples are generated by mixing two data samples with a mixing component. The task in the proposed framework is to predict the mixing component, which is utilized as soft targets in the loss function. Experiments demonstrate the framework's superior performance compared to other representation learning approaches on both univariate and multivariate time series and illustrate its benefits for transfer learning for clinical time series.  ( 2 min )
    Ranking of Communities in Multiplex Spatiotemporal Models of Brain Dynamics. (arXiv:2203.09281v1 [q-bio.NC])
    As a relatively new field, network neuroscience has tended to focus on aggregate behaviours of the brain averaged over many successive experiments or over long recordings in order to construct robust brain models. These models are limited in their ability to explain dynamic state changes in the brain which occurs spontaneously as a result of normal brain function. Hidden Markov Models (HMMs) trained on neuroimaging time series data have since arisen as a method to produce dynamical models that are easy to train but can be difficult to fully parametrise or analyse. We propose an interpretation of these neural HMMs as multiplex brain state graph models we term Hidden Markov Graph Models (HMGMs). This interpretation allows for dynamic brain activity to be analysed using the full repertoire of network analysis techniques. Furthermore, we propose a general method for selecting HMM hyperparameters in the absence of external data, based on the principle of maximum entropy, and use this to select the number of layers in the multiplex model. We produce a new tool for determining important communities of brain regions using a spatiotemporal random walk-based procedure that takes advantage of the underlying Markov structure of the model. Our analysis of real multi-subject fMRI data provides new results that corroborate the modular processing hypothesis of the brain at rest as well as contributing new evidence of functional overlap between and within dynamic brain state communities. Our analysis pipeline provides a way to characterise dynamic network activity of the brain under novel behaviours or conditions.  ( 3 min )
    Training Structured Neural Networks Through Manifold Identification and Variance Reduction. (arXiv:2112.02612v2 [cs.LG] UPDATED)
    This paper proposes an algorithm (RMDA) for training neural networks (NNs) with a regularization term for promoting desired structures. RMDA does not incur computation additional to proximal SGD with momentum, and achieves variance reduction without requiring the objective function to be of the finite-sum form. Through the tool of manifold identification from nonlinear optimization, we prove that after a finite number of iterations, all iterates of RMDA possess a desired structure identical to that induced by the regularizer at the stationary point of asymptotic convergence, even in the presence of engineering tricks like data augmentation and dropout that complicate the training process. Experiments on training NNs with structured sparsity confirm that variance reduction is necessary for such an identification, and show that RMDA thus significantly outperforms existing methods for this task. For unstructured sparsity, RMDA also outperforms a state-of-the-art pruning method, validating the benefits of training structured NNs through regularization.  ( 2 min )
    Minimax Rates for High-Dimensional Random Tessellation Forests. (arXiv:2109.10541v3 [math.ST] UPDATED)
    Random forests are a popular class of algorithms used for regression and classification. The original algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. In this work, we show that a large class of random forests with general split directions also achieve minimax rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, as well as random forests derived from Poisson hyperplane tessellations. Crucially, our rates adapt to sparsity in the sense that they depend only on the dimension of the relevant feature subspace as opposed to the ambient dimension of the feature space. This generalizes the known rates for Mondrian random forests. These are the first results showing that random forest variants with oblique splits can achieve minimax rates in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory.  ( 2 min )
    Self-Supervised Metric Learning in Multi-View Data: A Downstream Task Perspective. (arXiv:2106.07138v4 [stat.ML] UPDATED)
    Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction's weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, $k$-means clustering, and $k$-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation's accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the paper.  ( 2 min )
    Topological Graph Neural Networks. (arXiv:2102.07835v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive (in terms the Weisfeiler--Lehman graph isomorphism test) than message-passing GNNs. Augmenting GNNs with TOGL leads to improved predictive performance for graph and node classification tasks, both on synthetic data sets, which can be classified by humans using their topology but not by ordinary GNNs, and on real-world data.  ( 2 min )
    Learning Long-Term Reward Redistribution via Randomized Return Decomposition. (arXiv:2111.13485v2 [cs.LG] UPDATED)
    Many practical applications of reinforcement learning require agents to learn from sparse and delayed rewards. It challenges the ability of agents to attribute their actions to future outcomes. In this paper, we consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. A popular paradigm for this problem setting is learning with a designed auxiliary dense reward function, namely proxy reward, instead of sparse environmental signals. Based on this framework, this paper proposes a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning. We establish a surrogate problem by Monte-Carlo sampling that scales up least-squares-based reward redistribution to long-horizon problems. We analyze our surrogate loss function by connection with existing methods in the literature, which illustrates the algorithmic properties of our approach. In experiments, we extensively evaluate our proposed method on a variety of benchmark tasks with episodic rewards and demonstrate substantial improvement over baseline algorithms.  ( 2 min )
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v2 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained interpolating a pre-trained global model with a $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.  ( 2 min )
    Hierarchical Clustering and Matrix Completion for the Reconstruction of World Input-Output Tables. (arXiv:2203.08819v1 [stat.ML])
    World Input-Output (I/O) matrices provide the networks of within- and cross-country economic relations. In the context of I/O analysis, the methodology adopted by national statistical offices in data collection raises the issue of obtaining reliable data in a timely fashion and it makes the reconstruction of (part of) the I/O matrices of particular interest. In this work, we propose a method combining hierarchical clustering and Matrix Completion (MC) with a LASSO-like nuclear norm penalty, to impute missing entries of a partially unknown I/O matrix. Through simulations based on synthetic matrices we study the effectiveness of the proposed method to predict missing values from both previous years data and current data related to countries similar to the one for which current data are obscured. To show the usefulness of our method, an application based on World Input-Output Database (WIOD) tables - which are an example of industry-by-industry I/O tables - is provided. Strong similarities in structure between WIOD and other I/O tables are also found, which make the proposed approach easily generalizable to them.  ( 2 min )
    Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow. (arXiv:2107.00204v2 [cs.LG] UPDATED)
    For marketing, we sometimes need to recommend content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.  ( 2 min )
    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v1 [cs.LG])
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.  ( 2 min )
    Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration. (arXiv:2107.05446v3 [cs.LG] UPDATED)
    Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level of feature-space class-separation in the target domain. We address these issues for a particularly pervasive type of domain shift called measurement shift which can be resolved by restoring the source features rather than extracting new ones. In particular, we propose Feature Restoration (FR) wherein we: (i) store a lightweight and flexible approximation of the feature distribution under the source data; and (ii) adapt the feature-extractor such that the approximate feature distribution under the target data realigns with that saved on the source. We additionally propose a bottom-up training scheme which boosts performance, which we call Bottom-Up Feature Restoration (BUFR). On real and synthetic data, we demonstrate that BUFR outperforms existing SFDA methods in terms of accuracy, calibration, and data efficiency, while being less reliant on the performance of the source model in the target domain.  ( 2 min )
    A Framework and Benchmark for Deep Batch Active Learning for Regression. (arXiv:2203.09410v1 [stat.ML])
    We study the performance of different pool-based Batch Mode Deep Active Learning (BMDAL) methods for regression on tabular data, focusing on methods that do not require to modify the network architecture and training. Our contributions are three-fold: First, we present a framework for constructing BMDAL methods out of kernels, kernel transformations and selection methods, showing that many of the most popular BMDAL methods fit into our framework. Second, we propose new components, leading to a new BMDAL method. Third, we introduce an open-source benchmark with 15 large tabular data sets, which we use to compare different BMDAL methods. Our benchmark results show that a combination of our novel components yields new state-of-the-art results in terms of RMSE and is computationally efficient. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.  ( 2 min )
    Symmetry-Based Representations for Artificial and Biological General Intelligence. (arXiv:2203.09250v1 [q-bio.NC])
    Biological intelligence is remarkable in its ability to produce complex behaviour in many diverse situations through data efficient, generalisable and transferable skill acquisition. It is believed that learning "good" sensory representations is important for enabling this, however there is little agreement as to what a good representation should look like. In this review article we are going to argue that symmetry transformations are a fundamental principle that can guide our search for what makes a good representation. The idea that there exist transformations (symmetries) that affect some aspects of the system but not others, and their relationship to conserved quantities has become central in modern physics, resulting in a more unified theoretical framework and even ability to predict the existence of new particles. Recently, symmetries have started to gain prominence in machine learning too, resulting in more data efficient and generalisable algorithms that can mimic some of the complex behaviours produced by biological intelligence. Finally, first demonstrations of the importance of symmetry transformations for representation learning in the brain are starting to arise in neuroscience. Taken together, the overwhelming positive effect that symmetries bring to these disciplines suggest that they may be an important general framework that determines the structure of the universe, constrains the nature of natural tasks and consequently shapes both biological and artificial intelligence.  ( 2 min )
    Euler State Networks. (arXiv:2203.09382v1 [cs.LG])
    Inspired by the numerical solution of ordinary differential equations, in this paper we propose a novel Reservoir Computing (RC) model, called the Euler State Network (EuSN). The introduced approach makes use of forward Euler discretization and antisymmetric recurrent matrices to design reservoir dynamics that are both stable and non-dissipative by construction. Our mathematical analysis shows that the resulting model is biased towards unitary effective spectral radius and zero local Lyapunov exponents, intrinsically operating at the edge of stability. Experiments on synthetic tasks indicate the marked superiority of the proposed approach, compared to standard RC models, in tasks requiring long-term memorization skills. Furthermore, results on real-world time series classification benchmarks point out that EuSN is capable of matching (or even surpassing) the level of accuracy of trainable Recurrent Neural Networks, while allowing up to 100-fold savings in computation time and energy consumption.  ( 2 min )
    Identifiability of Sparse Causal Effects using Instrumental Variables. (arXiv:2203.09380v1 [stat.ME])
    Exogenous heterogeneity, for example, in the form of instrumental variables can help us learn a system's underlying causal structure and predict the outcome of unseen intervention experiments. In this paper, we consider linear models in which the causal effect from covariates $X$ on a response $Y$ is sparse. We prove that the causal coefficient becomes identifiable under weak conditions and may even be identified in models, where the number of instruments is as small as the number of causal parents. We also develop graphical criteria under which the identifiability holds with probability one if the edge coefficients are sampled randomly from a distribution that is absolutely continuous with respect to Lebesgue measure. As an estimator, we propose spaceIV and prove that it consistently estimates the causal effect if the model is identifiable and evaluate its performance on simulated data.  ( 2 min )

  • Open

    Building games and apps entirely through natural language using OpenAI’s code-davinci model
    submitted by /u/bperki8 [link] [comments]
    Neural Network is training to give just one output, how can I prevent this?
    Look for more info at: https://stackoverflow.com/questions/71533736/neural-network-is-training-to-give-just-one-output-how-can-i-prevent-this submitted by /u/UnityPlum [link] [comments]
    This Research Paper Explain The Compute Trends Across Three Eras Of Machine Learning
    The three essential components that determine the evolution of modern Machine Learning are computing, data, and algorithmic advancements (ML). The article looks at trends in the most easily quantifiable element. Before 2010, training computes expanded in lockstep with Moore’s law, doubling every two years. Since the early 2010s, when Deep Learning was first introduced, the rate of training compute has quickened, roughly doubling every six months. Late in 2015, a new trend emerged. The history of computation in ML has been divided into three eras based on these observations – the Pre-Deep Learning Era, the Deep Learning Era, and the Large-Scale Era. The article summarises the fast-growing compute requirements for training advanced ML systems. Trends The comparison is made on a dataset of 123 milestone ML systems, annotated with the computing it took to train them. Before Deep Learning took off, there was a period of slow progress. The tendency accelerated in 2010 and hasn’t slowed since. Separately, in 2015 and 2016, a new trend of large-scale models arose, expanding at a comparable rate but by two orders of magnitude faster than the preceding one. Continue Reading The Research Summary Paper: https://arxiv.org/pdf/2202.05924.pdf Github: https://github.com/ML-Progress/Compute-Trends submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Tokyo Researchers Hit the Lottery Ticket Theory with “Hiddenite” AI Chip - News
    submitted by /u/allaboutcircuits [link] [comments]
    ‘Searching for Escher’ - an AI journey through his work
    submitted by /u/glenniszen [link] [comments]
    Researchers From Idiap Research Institute Propose ‘HyperMixer’: An MLP-Based Green AI Alternative To Transformers
    In recent years, transformer topologies have advanced the state-of-the-art in a wide range of natural language processing (NLP) activities. Vision transformers (ViT) are now being used more frequently in computer vision. However, because of the quadratic complexity of transformers over input length, they consume a lot of energy, which limits their research and development and industrial deployment. An MLP-based Green AI Replacement to Transformers presents a novel Multi-Layer Perceptron (MLP) model, HyperMixer, as an energy-efficient alternative to transformers that preserves similar inductive biases, according to a recent article published by the Idiap Research Institute in Switzerland. The researchers demonstrate that HyperMixer may achieve performance comparable to transformers while significantly reducing processing time, training data, and hyperparameter tuning expenses. HyperMixer is a new all-MLP model with inductive biases inspired by transformers. On the GLUE benchmark, HyperMixer’s performance was compared to competitors’. HyperMixer learns attention patterns similarly to transformers, as demonstrated by this ablation. Continue Reading The Research Summary Paper: https://arxiv.org/pdf/2203.03691.pdf https://preview.redd.it/4dbukkw5d6o81.png?width=1502&format=png&auto=webp&s=52f61392525435cf105d4bfee508a4c05eef5389 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI Capabilities for Offensive Cyber Attacks - Community Survey
    Hello, I'm working on my graduate paper regarding artificial intelligence being utilized for offensive based cyber attacks. Not just focusing on Adversarial AI, rather exploring AI use cases for malicious code generation or automated vulnerability discovery and exploitation. Would any SMEs within the community be interested in participating in a Chatham house rules interview regarding this topic? submitted by /u/Vicarfort [link] [comments]  ( 1 min )
    Difference between random and unknown variable in probability
    submitted by /u/IMPuzzled2 [link] [comments]
    What The Stanford 2022 AI Index Tells us About AI Adoption
    submitted by /u/Beautiful-Credit-868 [link] [comments]  ( 1 min )
    Researchers From The Hartree Centre, IBM, And REPROCELL Propose An Explainable Machine Learning Approach That Combines Bioinformatics And Domain Insight To Inform Precision Medicine Strategies For Inflammatory Bowel Disease
    A project supported by the STFC Hartree Centre Discovery Accelerator accurately predicts patient response to treatments for ulcerative colitis and Crohn’s disease. Artificial intelligence may soon assist more than 6 million1 individuals worldwide who suffer from inflammatory bowel disease (IBD) in selecting the optimum medication for their illness. An explainable AI pharmacogenomics methodology we created effectively predicted how patients will respond — favorably or negatively — to an IBD treatment 95% of the time, according to research published in PLOSone. Chronic inflammatory bowel diseases (IBDs) such as ulcerative colitis and Crohn’s disease are caused by clinical, genetic, and environmental variables such as nutrition and lifestyle. Even though all patients have the same symptoms, there is no one-size-fits-all treatment for IBD that is helpful for everybody. Choosing the optimum therapy for a patient is still a trial-and-error procedure for both the doctor and the patient. According to researchers at IBM Research in the UK and REPROCELL, a stem cell and fresh tissue research firm, used IBD patient data and explainable AI approaches to study treatment reactions with the help of the STFC Hartree Centre’s Discovery Accelerator. Their objective was to discover the optimum medications for IBD therapies less of a guessing game. The resulting collection of algorithms demonstrated that it was feasible to crack the IBD data black box and comprehend forecast and explain how persons with IBD could react to different medications on the market and under development. Continue Reading Our Research Summary Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0263248 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Artificial Nightmares : Night Elf Forrest || Clip Guided Disco Diffusion AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    [D] Looking for a GPU server for our university research lab
    Our lab is recently doing more deep learning projects and our current lab server is struggling with the increasing load. My professor tasked me with finding our lab a good GPU server for the lab. The reserved budget is 25K. Our current server have the following specifications: CPU: Intel Xeon Silver 4114 CPU RAM: 192GB GPU: NVIDIA Tesla V100 PCIe 32GB (one GPU) There are 4 phd students currently working on the server. I prefer something better than the current server. So far I found Lambda Labs and ThinkMate (based on a friend recommendation). My professor want me to select a vendor with good support and offering academic discounts. I'm looking for GPUs that are perform similar or better than Nvidia RTX 2080 Ti. Do you have a recommendation for any vendor that builds GPU servers (~$25K, servicing 3-5 students working mostly on CNN and RNN models), and which server configurations is best? Thank you. submitted by /u/majax21 [link] [comments]  ( 3 min )
    [D] Overview SOTA Results of NN on Various Datasets
    Hi everybody, I developed a new NN training approach and want to benchmark it now. I am looking for Classification/Regression tasks on various normal and image datasets, but also autoencoder-like reconstruction problems (I guess denoising is here the classic evaluation task?) Are there any good websites/papers which collect current SOTA results on multiple datasets? It would be great if they also provide training times, number of parameters, etc. What datasets would you recommend to definitely include in such a benchmark? Thanks in advance! submitted by /u/arcxtriy [link] [comments]  ( 1 min )
    [Research] Assessing stability of cluster label assignment over cross-validation folds
    Hi everyone, I have a dataset of continuous multivariate samples. I'm clustering these samples together using an unsupervised algorithm to infer unmeasured latent states. Ideally, this would just be performed on the entire sample set. However, this label will then subsequently form part of a supervised prediction algorithm, and because the dataset isn't very large (several hundred samples), I'm almost certainly going to end up using k-fold cross validation. ​ Now, the issue with using cluster labels generated from the whole dataset is that this essentially permits data leakage from the test fold back into the training fold for the supervised learning algorithm (which, for the purposes of my experiment, I don't want). So instead, the unsupervised labels will need to be generated for each fold individually. ​ What I would like to do is have a measure of the relative stability/instability of label assignment for each sample across the folds. So for example, over 10 folds, ideally sample 1 would always get the label 'A' ten times out of ten... but in reality, perhaps it will get label 'A' eight times, and label 'B' twice. ​ It sounds like the Bhattacharyya coefficient might be an appropriate scoring system (i.e. for each sample, calculate the square root of the product of the label proportions, and sum over all samples) - but I've never used this before, so I don't know whether it would be an appropriate use case. Does anyone have any familiarity with either this measure or this type of situation in general? submitted by /u/mcflyanddie [link] [comments]  ( 1 min )
    [D] Democratization of large (language) models
    Hi all, I've been thinking about how to democratize running and training of large language models, and more generally how to democratize all large models. I've wondered if this would be achievable with a volunteer computing project. I've found two (and only two, unfortunately) papers on the subject, so I at least know that I'm not the only one thinking about this. Those papers can be found here and here. That would be the end of the discussion for me, but I think the previous approaches have some flaws. Both papers use data parallelism, making training of extremely large (> 20GB) models basically impossible, because very few volunteers will have enough resources to run, much less train, the models, regardless of batch size. Systems like Hydra might be able to run things locally for models between 20 to 250GB, but above that it becomes a question of disk space. A more theoretical flaw is that both of the papers use a modified version of SGD, which means any model would be required to use special optimizers. I don't know how that would impact results/convergence, but it doesn't seem trivial. In conclusion, it seems to me that a VC system for extremely large models probably needs a novel approach that uses tensor parallelism, so that work could be more granularly distributed. It's (probably) not an easy problem, but it seems possible. I'm curious what this community's thoughts are, and if anyone would also be interested in working on this. Also, if anyone knows any other resources on this problem, please drop a link so I can check them out! submitted by /u/Lazauya [link] [comments]  ( 1 min )
    [D] Can I submit to ARR for resubmission and at the same time commit to a workshop
    In the 2 links below, it says you can commit to conferences such as ACL 2022 and NAACL 2022 at the same time submit to ARR for resubmission, but it doesn't say if you can do the same for workshops or conferences relating to ARR other than the 2 conferences listed above. In the last link, it pretty much specifies two conferences (maybe just listed as an example?) that you can commit and at the same time submit to ARR for resubmission, but for some reason doesn't directly say that you can commit to ARR related conference/workshop alongside submitting to ARR for resubmission. So if anybody has any info on whether I can submit to ARR for resubmission and at the same time commit to any ARR related workshop/conference, I'd appreciate it. https://www.2022.aclweb.org/post/acl-2022-chair-blog-post-faq https://aclrollingreview.org/choices/ submitted by /u/SiegeMemeLord [link] [comments]  ( 1 min )
    [D] Layer normalization in transformer blocks.
    I was playing with the idea of trying a transformer for a sequence modelling task for the first time recently - I was interested in all the X-former variants interested in better training stability, eg. Catformer or rezero or adding gates to the residual connections. I found I always got late NaN losses whenever the transformer blocks contained layer normalization. Removing layer normalization or using catformer (which has no layer normalization) this didn't happen. I was curious about if anyone knows why this is the case, and what current ideas are about the necessity of layer normalization to the transformer. submitted by /u/WigglyHypersurface [link] [comments]  ( 1 min )
    [D] Choosing EC2 instance with two goals in mind
    I’m running train tests on data that is quite large (1.5M, 400). I need to preprocess such data with a data transformation and model-based data imputation pipeline. Then I need to run train/test fits on a neural net that needs PyTorch and GPU. With this in mind, what is the most cost-effective aws ec2 instance for large data processing that has gpu as well? Right now I’m either running it all on a p3.2xlarge, or I use a c5d.9xlarge for the data preprocessing, save that file and then move back to p3.2xlarge to run the train/test. Second option is not ideal because I’d want to run several tests quickly even when changing around feature engineering or features selected. submitted by /u/Gioamorim80 [link] [comments]  ( 1 min )
    [R] New paper on autonomous driving and multi-task: "HybridNets: End-to-End Perception Network"
    Hello, we are publishing our first paper as undergraduate students today. It is achieving SOTA on the BDD100K dataset (2 out of 3 tasks, at least). ​ Paper: https://arxiv.org/abs/2203.09035 Code: https://github.com/datvuthanh/HybridNets ​ Network architecture: HybridNets architecture ​ Contributions: HybridNets, an end-to-end perception network, achieving outstanding results in real-time on the BDD100K dataset for 3 tasks: traffic object detection, drivable area segmentation (not SOTA), and lane line detection. Automatically customized anchor for each level in the weighted bidirectional feature network, on any dataset. An efficient training loss function and training strategy to balance and optimize multi-task networks. submitted by /u/xoiga123 [link] [comments]  ( 1 min )
    [R] Deep Batch Active Learning for Regression
    (First author here.) Are you interested in improving the sample-efficiency of neural network regression through active learning? Then, our new paper might be all you need 🙂 Paper: https://arxiv.org/abs/2203.09410 Code: https://github.com/dholzmueller/bmdal_reg Short summary: Using random projections of NN gradients as features and running a simple clustering method on these works really well, is scalable and convenient to use through our code. Long summary: We study pool-based batch mode deep active learning (BMDAL) for regression: We start with a small labeled training data set and repeatedly select a batch of unlabeled data from a large pool set for labeling. We want to select large batches since NN training can be slow. ​ Deep Batch Active Learning loop We propose a benchmark wi…  ( 3 min )
    [D] How does publishing in ML+Healthcare papers work
    I have started my PhD in application of machine learning in health care. I want to understand what it takes to publish in papers like Journal of the American Medical Informatics Association (JAMIA) . Are there any patterns in the publication that one can look out for? Do they always need the best state of the art results or a novel algorithm? submitted by /u/Complex_State9960 [link] [comments]  ( 2 min )
    [D] Best labeling tool for action localization from video?
    I would like to discuss what people their experiences are with labeling tools for action localization in videos. Currently I only found CVAT to be somewhat decent (but still pretty bad). Besides this there are some Github repos that publish a (not/semi) working solution, but they also don't do the trick for me. My biggest issue with CVAT is that I can only upload 1 video per task and that it is limited to 500MB. I would like to know what other tools people have used and what your experiences are? submitted by /u/FreddyShrimp [link] [comments]  ( 1 min )
    [D] Paper Review - Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Video Walkthrough)
    https://youtu.be/O_dJ31T01i8 Catastrophic forgetting is a big problem in mutli-task and continual learning. Gradients of different objectives tend to conflict, and new tasks tend to override past knowledge. In biological neural networks, each neuron carries a complex network of dendrites that mitigate such forgetting by recognizing the context of an input signal. This paper introduces Active Dendrites, which carries over the principle of context-sensitive gating by dendrites into the deep learning world. Various experiments show the benefit in combatting catastrophic forgetting, while preserving sparsity and limited parameter counts. ​ OUTLINE: 0:00 - Introduction 1:20 - Paper Overview 3:15 - Catastrophic forgetting in continuous and multi-task learning 9:30 - Dendrites in biological neurons 16:55 - Sparse representations in biology 18:35 - Active dendrites in deep learning 34:15 - Experiments on multi-task learning 39:00 - Experiments in continual learning and adaptive prototyping 49:20 - Analyzing the inner workings of the algorithm 53:30 - Is this the same as just training a larger network? 59:15 - How does this relate to attention mechanisms? 1:02:55 - Final thoughts and comments ​ Paper: https://arxiv.org/abs/2201.00042 Blog: https://numenta.com/blog/2021/11/08/can-active-dendrites-mitigate-catastrophic-forgetting submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Mac Studio as a computer for machine learning and data science
    I'm curious what people think of Apple's new Mac Studio as a potential workstation for machine learning and data processing tasks? It seems like Tensorflow can makes use of the M1 chip, but not Pytorch yet - I don't know how these compare to traditional GPU setups though? More generally, do you think the Mac Studio would be a good dev machine and good for other data-based work? I'm not sure, as it seems like the Studio is being pitched for people doing things like video editing rather than coding. Thanks! submitted by /u/uio8 [link] [comments]  ( 2 min )
    [N] Nordic Probabilistic AI School (ProbAI) — June 13-17, 2022
    You are welcome to apply for the Nordic Probabilistic AI School (ProbAI) 2022 being held on June 13-17 in Helsinki (Finland). APPLY NOW — The application deadline is March 27. About ProbAI 2022 The mission of the third Nordic Probabilistic AI School (ProbAI) is to provide an inclusive education environment serving state-of-the-art expertise in machine learning and artificial intelligence. The public, students, academia and industry are welcome to join ProbAI 2022. ProbAI is an intermediate to advanced level "summer" school with the focus on probabilistic machine learning. Covered are topics such as probabilistic models, variational approximations, deep generative models, latent variable models, normalizing flows, neural ODEs, probabilistic programming, and much more. The ProbAI 2022 w…  ( 3 min )
    [D] How did you decide a research topic?
    Those of you who did/doing a PhD recently, how did you decide your PhD topic? My advisor seems to give me a lot of freedom, by letting me choose my own topic, but has said if it is away from his interests he can't help a lot. He also would apply for funding, once I have a clear topic, which adds even more pressure. He wants me to write a survey paper, so that both of us can know the literature in the field. I've tried looking at the research by AI companies, but most of the areas are completely crowded. Which brings me to my question, how do you go about choosing your topic? Was a broad idea given to you by your advisor? Were you also expected to know everything about your field? How did you have the confidence in selecting a topic, given most of recent papers is just randomly adding/shifting/removing layers (no offense). Also, what did you expect/get from your advisor? Given my situation, it seems like I'll have to run the show, which makes me anxious. submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]  ( 7 min )
  • Open

    [R] New paper on autonomous driving and multi-task: "HybridNets: End-to-End Perception Network"
    submitted by /u/xoiga123 [link] [comments]  ( 1 min )
  • Open

    Enable conversational chatbots for telephony using Amazon Lex and the Amazon Chime SDK
    Conversational AI can deliver powerful, automated, interactive experiences through voice and text. Amazon Lex is a service that combines automatic speech recognition and natural language understanding technologies, so you can build these sophisticated conversational experiences. A common application of conversational AI is found in contact centers: self-service virtual agents. We’re excited to announce that you […]  ( 9 min )
  • Open

    AI-generated utopias
    AI isn't known for being able to solve the big problems, but what about the VERY large problems, such as possible futures to strive for? I decided to find out if I could get GPT-3 to come up with new ideas for utopias. Since GPT-3 works by predicting  ( 3 min )
    Bonus: Utopias that make no sense
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Multi-Head Attention
    Examining a module consisting of several attention layers running in parallel.  ( 4 min )
  • Open

    Hopped Up: NVIDIA CEO, AI Leaders to Discuss Next Wave of AI at GTC
    NVIDIA’s GTC conference is packed with smart people and programming. The virtual gathering — which takes place from March 21-24 — sits at the intersection of some of the fastest-moving technologies of our time. It features a lineup of speakers from every corner of industry, academia and research who are ready to paint a high-definition Read article > The post Hopped Up: NVIDIA CEO, AI Leaders to Discuss Next Wave of AI at GTC appeared first on NVIDIA Blog.  ( 3 min )

  • Open

    [D] On the difference (or lack thereof) between Cross-Entropy Loss and KL-Divergence
    So cross-entropy(H(p,q)) and KL-divergence (KL(p||q)) relate to each other as follows: H(p,q) = KL(p||q) + H(p) and KL(p||q) = H(p,q) - H(p) where p is the data distribution and q is the model distribution. When p is constant (as is the case in most ML problems), minimizing H(p,q) is equivalent to minimizing KL(p||q). However, there seems to be some ambiguity about this. (One practitioner claims)[https://stats.stackexchange.com/a/409271] that there is a difference in practice, because during batch gradient descent the data distribution p' in each batch is noisy and harder learn for the model, leading to worse performance for the KL-divergence. I am skeptical about his claim, as H(p) is part of both cross-entropy and KL-divergence, depending on how one views them. If anything, the KL-divergence should work better because it does not directly incorporate H(p). What is your experience / your thoughts? submitted by /u/optimized-adam [link] [comments]  ( 1 min )
    Machine learning - Fawkes - facial recognition [discussion]
    Hello to everyone, At the moment exist any way to avoid facial recognition from social media? Not Fawkes.... Another way because I think AI learned that the picture is modified with this software. It is possible that the software recognize this pictures as fake face? Also combine fawkes with another software as Photoshop could be a good idea to look as another user? In social media the people use Photoshop... I don't know if is still possible to recognize easy Thank you submitted by /u/Maleficent_Camel5718 [link] [comments]  ( 1 min )
    [P] Extracting "potential user profiles" from firewall logs
    Hi everyone. For a project i need to find a way to extract some "potential user profiles" from firewall logs, such as the pic i attach to this post. For example i would like to infer that user A belongs to the "administrative" part of the company whyle Bob belongs to the tecnician department. Any hints on how to do this? I taught about using k-means on some certain fields such as destination, source, protocol. The problem is that k-means flags different records of the same user with different clusters, and i guess this may not be a good strategy (I might try to use some kind of smoothening in which i flatten the cluster, for example if BoB has labels 0 0 0 0 0 1 1 1 1 0 0 0 0 0 then bob has cluster 0). ​ https://preview.redd.it/tehf5axof0o81.png?width=1489&format=png&auto=webp&s=720ab6f792e2d409735e662d602e76ac7909e23b submitted by /u/Set-New [link] [comments]  ( 1 min )
    [R] Domain-specific pre-training of GPT? Help!
    I am looking to adapt GPT-2 to generate dialogue utterances in the style of a certain demography/population (let's call them Ogres). However, there are no large datasets that are both 1) dialogue datasets, and 2) are generated by this target demography. In the absence of such data, I have been considering a few approaches for data augmentation purposes. Many of those approaches would benefit from a GPT-Ogre, which is at least capable of generating text similar to Ogres, if not necessarily dialogic. Approach 1 ========== For this, I am considering performing additional pre-training of, say, GPT-2 on some medium-sized corpora generated by Ogres. This sounds like something that should have been done by a lot of people for a lot of different things by now, but except for some papers that have tried to do this with BERT in the Medical domain, I was not able to find any papers/GitHub repos that have done this with additional unsupervised pre-training GPT. It would be helpful if someone could point me to some resources around this as I feel the space of hyperparameters to figure out the best learning rate, etc. is too large, and if somebody has already done this, it would be easy to replicate it. Approach 2 ========== There are some dialogue-specific GPT models such as DialoGPT that have been fine-tuned (in a supervised way; mind you, not pretrained in an unsupervised way). However, it is not in the Ogre style. I am wondering if it's a ridiculous idea to perform additional pre-training of a fine-tuned GPT-2 model? submitted by /u/exceptaway343 [link] [comments]  ( 1 min )
    [R] Restoring and attributing ancient texts using deep neural networks
    Ancient history relies on disciplines such as epigraphy — the study of inscribed texts known as inscriptions — for evidence of the thought, language, society and history of past civilizations. However, over the centuries, many inscriptions have been damaged to the point of illegibility, transported far from their original location and their date of writing is steeped in uncertainty. This work presents Ithaca, a deep neural network for the textual restoration, geographical attribution and chronological attribution of ancient Greek inscriptions. Paper: https://www.nature.com/articles/s41586-022-04448-z Code: https://github.com/deepmind/ithaca Online interface: https://ithaca.deepmind.com/ Blog: https://deepmind.com/blog/article/Predicting-the-past-with-Ithaca Video: https://www.youtube.com/watch?v=rq0Ex_qCKeQ submitted by /u/yannisassael [link] [comments]  ( 1 min )
    [Discussion] PyTorch Lightning vs DeepSpeed vs FFCV vs mosaic vs ... what are they and how can you use them?
    Deepspeed? FSDP? FFCV? XYZK? what do they all mean and how can you use all of them to speed up your model training? All amazing techniques developed by world-class teams and are (or are being made) accessible via PyTorch Lightning! If you know of other techniques you want to be integrated, please comment below! https://william-falcon.medium.com/pytorch-lightning-vs-deepspeed-vs-fsdp-vs-ffcv-vs-e0d6b2a95719 submitted by /u/waf04 [link] [comments]  ( 1 min )
    [Discussion] Is clean real-world data important for model training?
    Hi reddit experts, I know that for any machine learning and AI application, you need data in order to train your model. I would like to ask the following: Where does this data usually come from? (For instance, is this data provided by the customer? Your professor? Is it always assumed that this data will be available? Can you trust the source?) If you don’t already have this data, will you be asked to implement a system to collect this data? Would the accuracy of this data be important to the model? Meaning, let’s say if we wanted to train the model of an AI temperature controller. Would it really matter if the data was just 25 degrees? Versus 25.58 degrees? Is it important that this data is sampled fast enough? Meaning, if I only had 1 data point per day, versus 10 data points per minute? Here’s some context to why I’m asking. I’ve been a measurement and instrumentation engineer for a good part of my career. We set up measurement systems that allow customers to collect all kinds of real-world data, including temp, vibration, voltage, images, etc. The other day I had a debate with a friend who does AI, where he claimed that “collecting the data" was a trivial effort, because most of time he just uses the data that’s been given to him. But I challenged him, because there's lots of know-how needed in acquiring real-world physical data (for ex: which sensors to use, how to avoid noise, how to synchronize timestamps, etc.) All of this know-how ensures that the collected data is accurate, clean, and complete. But he just kinda brushed this off, and said that most ML/AI algorithms nowadays just look at the general trend of data, absolute accuracy doesn’t really matter, but there still needs to be some relative variability, then he can work with it. I’m having a hard time believing this is true. Can anyone shed some light on this subject? submitted by /u/sharkera130 [link] [comments]  ( 2 min )
    [Research] A great resource for Risk-Averse RL!
    https://www.cs.unh.edu/~mpetrik/tutorials/risk/ submitted by /u/Ok_Can2425 [link] [comments]  ( 1 min )
    GPU's and Memory [D]
    Hello! just made my first gan and am keen to try make one in 1024x1024 I am looking to upgrade my gpu processing power by adding a RTX A4000 16g to my existing RTX 2080 8g will the memory stack as I have been working around some cuda memory allocation issues so far its working but Id like to know before I put down that kind of money will I be able to go past a 8g limit if I add the A4000 to my rig I have tried googling this question to no clear answer, feel free to link something related to this. submitted by /u/Trainsmurf [link] [comments]  ( 2 min )
    [Discussion] Anyone know some Machine Learning Games Similar to Wobbledogs or Creatures 3?
    I am obsessed with machine learning despite not being very smart at it myself. But specifically, I love it in games. I love interacting with it. I'm looking for more games like these: I play NovelAI, Creatures 3, Wobbledogs, AI Dungeon, Replika, and I was going to pick up Species: Artificial Life Real Evolution. Someone made a hook of what I think is GPT 2.7B to Crusader Kings so that you could talk to the different countries' leaders. Wobbledogs uses it for dog's walking. Motivation and desire are present. It also has positive / negative reinforcement. AI Dungeon, Replika, and NovelAI are all GPT - OpenAI GPT Da Vinci (for AID Dragon), OpenAI GPT 2.0 (of some sort, for Replika), and Fairseq / EleutherAI GPT models (for NovelAI). Creatures 3 was one of the first pseudo(?) neural network games and utilizes motivation, desire, and positive / negative reinforcement. Species: Artificial Life Real Evolution is an abandoned game which simulated lifeforms evolving in stressful situations. They don't seem to actually be neural network, but it's close enough to be very interesting - creatures have motivations and desires but do not experience positive / negative reinforcement. There's also 60,000 of them at once... I guess it's sort of also got generational adaptation! PS, if anyone wants to obsess over these awesome games with me, let's do it! submitted by /u/WiIdCherryPepsi [link] [comments]  ( 1 min )
    [R] New paper on Tabular DL: "On Embeddings for Numerical Features in Tabular Deep Learning"
    Hi! We introduce our new paper "On Embeddings for Numerical Features in Tabular Deep Learning". Paper: https://arxiv.org/abs/2203.05556 Code: https://github.com/Yura52/tabular-dl-num-embeddings TL;DR: using embeddings for numerical features (i.e. using vector representations instead of scalar values) can lead to significant profit for tabular DL models. Let's consider the vanilla MLP taking two numerical inputs. https://preview.redd.it/yb55tdw27wn81.png?width=330&format=png&auto=webp&s=a6fc53e8611baee6993aab47480f0a6a6b85e46c Now, here is the same MLP, but now with embeddings for numerical features: https://preview.redd.it/zebl8tld7wn81.png?width=368&format=png&auto=webp&s=3d20652075d0543c7d6c70f34d67140bc2c6346b The main contributions: we show that using vector representations instead of scalar representations for numerical features can lead to significant profit for tabular DL models we show that MLP-like models equipped with embeddings can perform on par with Transformer-based models we make some progress in the "DL vs GBDT" competition submitted by /u/Yura52 [link] [comments]  ( 2 min )
    [P] Unsupervised Feature Selection using Supervised Algorithms
    FRUFS is a Feature Relevance based Unsupervised Feature Selection approach using supervised algorithms such as XGBoost, etc. This is a new algorithm that I developed and put down in the form of a blog. Here's the link to the article. While reading the blog you will also re-discover an influential research paper in this domain on your own! FRUFS was evaluated on 4 different datasets under both unsupervised and supervised settings. The results are available on the blog as well as in the table below. Dataset Task All Features FRUFS Metric MNIST Unsupervised 50.48 53.70 NMI Waveform Unsupervised 38.20 39.67 NMI Ionosphere Supervised 88.01 91.45 Accuracy Adult Supervised 62.16 62.65 F1-score ​ Fig. 1 shows the feature importance of FRUFS (left) in comparison to the actual feature importance (right) on the MNIST dataset. This was done without any labels. Fig. 1: Visualizing feature/pixel importance of MNIST dataset ​ I have also developed a scikit-learn compatible library in case you want to use FRUFS for your own project. Here's the github repo. A simple pip install FRUFS will do the trick! Do give the blog a read and let me know your thoughts. submitted by /u/atif_hassan [link] [comments]  ( 1 min )
  • Open

    Last Week in AI: Deepfakes for Politicians, AI Propaganda, Operation Safety Net, Understanding Pigs
    submitted by /u/regalalgorithm [link] [comments]
    In A Latest Machine Learning Research, NVIDIA Researchers Propose A Novel Critically-Damped Langevin Diffusion (CLD) For Score-Based Generative Modeling
    A promising family of generative models has emerged: score-based generative models (SGMs) and denoising diffusion probabilistic models. SGMs have applications in image, voice, and music synthesis, image editing, super-resolution, image-to-image translation, and 3D shape generation because they provide high-quality synthesis and sample variety without requiring adversarial aims. SGMs use a diffusion process to progressively introduce noise to the data, changing a complicated data distribution into a tractable prior distribution for analysis. The modified data’s score function—the gradient of the log probability density—is then learned using a neural network. To synthesize new samples, the learned scores can be used to solve a stochastic differential equation (SDE). Inverting the forward diffusion corresponds to an iterative denoising process. Continue Reading Paper: https://arxiv.org/pdf/2112.07068.pdf Project: https://nv-tlabs.github.io/CLD-SGM/ Code: https://github.com/nv-tlabs/CLD-SGM ​ https://i.redd.it/b2wxorotuzn81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Semantic StyleGAN - A novel approach for face generation and editing
    submitted by /u/imapurplemango [link] [comments]
    AI-modified short story study
    Hi all! Anybody interested in reading an AI-modified story? I created a user study where a short story of about 3000 words has been modified with NLP, and one of the various versions is shown to the reader. The system gives you a very short personality test, and asks you a few questions on what you thought of the story. There are Amazon vouchers worth 5 GBP available for those doing this now, in any country! People found not to have done the study carefully enough might be excluded. It's at https://cci.arts.ac.uk/~wnybom/cloak.html submitted by /u/wnybom [link] [comments]  ( 1 min )
    Is AI intrinsically hyped?
    submitted by /u/bendee983 [link] [comments]  ( 1 min )
    Agent types vs Algorithms
    Hi, I recently enrolled for AI in a local college, I find AI as a whole fascinating. One problem that I've run into is, the concept of user agents. From my understanding, user agents are the entities that try to obtain a goal, such as a human playing chess, here the human would be the user agent. The humans sensors would be their eyes and brain. The humans actuators would be their hands. Similarly an autonomous taxi could be a user agent, the taxis sensors could be cameras, speedometer, etc. It's actuators could be steering, breaks, horn, signals etc. If you can tell by the latter example above, I'm currently reading AI: A modern approach(3rd edition). My question is; I've learned about some basic search algorithms such as DFS and BFS. I've implemented a DFS algorithm to solve a maze ie get to the exit node. What is my user agent in this program? Is the user agent the algorithm itself(DFS)?? Or do I have to create a class and put the algorithm as a function(I'm using Python). Thanks :) submitted by /u/Adam20188 [link] [comments]  ( 1 min )
    Multilingual AI: how to perform text processing/text generation in non-English languages
    Hello, Text processing AI has made great progress these last years but the main focus is on the English language (understandably). I think that many people are trying to do Natural Language Processing in non-English languages but are disappointed by the results. It is especially hard with text generation models like GPT-3, GPT-J, GPT-NeoX... In this article, I'm trying to quickly summarize what the options are today for people trying to use a multilingual AI: https://nlpcloud.io/multilingual-nlp-how-to-perform-nlp-in-non-english-languages.html If you can think of additional solutions not mentioned in this article please let me know! submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    Microsoft Introduces ‘PeopleLens’: An Open-Ended Artificial Intelligence System That Uses Computer Vision Algorithms To Help Young People Who Are Blind To Engage With Their Immediate Social Surroundings
    Social engagement can be challenging for children who are born blind. Despite a great desire to do so, many blind children and young people with low vision fail to engage and befriend individuals in their age group. This can be extremely difficult for the kid or adolescent, as well as their support network of family members and instructors who wish to assist them in making these crucial connections. Microsoft team has recently developed “PeopleLens,” an open-ended AI system that provides additional resources to people who are blind or have low vision to make sense of and interact with their local social settings. The system uses Nreal Light augmented reality glasses connected to a smartphone, allowing them to expand their existing talents and abilities. Quick Read: https://www.marktechpost.com/2022/03/16/microsoft-introduces-peoplelens-an-open-ended-artificial-intelligence-system-that-uses-computer-vision-algorithms-to-help-young-people-who-are-blind-to-engage-with-their-immediate-social-surround/ Paper: https://www.microsoft.com/en-us/research/uploads/prod/2021/06/Morrison-Interactions2021\_PeopleLens.pdf Microsoft Blog: https://www.microsoft.com/en-us/research/blog/peoplelens-using-ai-to-support-social-interaction-between-children-who-are-blind-and-their-peers/ https://preview.redd.it/wvol1y7nnvn81.png?width=1542&format=png&auto=webp&s=d95e16d28797ce795634902cf02a125a05839312 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    New Project
    I created a new project, and am currently looking for people willing to help: https://botbox.dev/ice-dragon-ai/ it's a free AI art generator If you are willing to help, either email me (you can find it in the link) or DM me on reddit submitted by /u/Recent_Coffee_2551 [link] [comments]
    Artificial Nightmares : Call of Cthulhu || Clip Guided Disco Diffusion AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    In A Latest Machine Learning Research, NVIDIA Researchers Propose A Novel Critically-Damped Langevin Diffusion (CLD) For Score-Based Generative Modeling
    A promising family of generative models has emerged: score-based generative models (SGMs) and denoising diffusion probabilistic models. SGMs have applications in image, voice, and music synthesis, image editing, super-resolution, image-to-image translation, and 3D shape generation because they provide high-quality synthesis and sample variety without requiring adversarial aims. SGMs use a diffusion process to progressively introduce noise to the data, changing a complicated data distribution into a tractable prior distribution for analysis. The modified data’s score function—the gradient of the log probability density—is then learned using a neural network. To synthesize new samples, the learned scores can be used to solve a stochastic differential equation (SDE). Inverting the forward diffusion corresponds to an iterative denoising process. Continue Reading Paper: https://arxiv.org/pdf/2112.07068.pdf Project: https://nv-tlabs.github.io/CLD-SGM/ Code: https://github.com/nv-tlabs/CLD-SGM ​ https://i.redd.it/8dl9ftuquzn81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Neural network for keystroke biometrics
    Hi all, I apologize for the lack of technical knowledge here, as I have a biology background and am so very much out of my wheelhouse here. I was wondering if it was possible to just get a neural network that has already been trained to analyze keystroke biometric data and build profiles off of that information. We are having participants type out a sentence from Alice in Wonderland with a character limit of 100 characters. Background: I am a forensic science masters student with a biology background. My masters program has a cybercrime course where we have to conduct a research project with a digital forensic focus, and we've had very little guidance thus far. My project group decided to do a keystroke biometric comparison between English typing vs foreign language typing to see if both profiles could be tied back to one individual (eg, does switching between languages when typing affect typing manner and thus change the keystroke profile of one individual). Unfortunately, this has become a much more technical project than anticipated, and , as I mentioned, NONE of us has any real digital background. Our lecturer just sort of left us hanging and said "cool, go build and train your own neural network for this project", while my group has no idea HOW to actually do that. Thank you so much for any and all help! As I said, I deeply apologize for my lack of knowledge in this subject, and hope my post here is adequate. submitted by /u/determinedkorra [link] [comments]  ( 1 min )
  • Open

    Build a traceable, custom, multi-format document parsing pipeline with Amazon Textract
    Organizational forms serve as a primary business tool across industries—from financial services, to healthcare, and more. Consider, for example, tax filing forms in the tax management industry, where new forms come out each year with largely the same information. AWS customers across sectors need to process and store information in forms as part of their […]  ( 6 min )
  • Open

    Offline Optimization for Architecting Hardware Accelerators
    Posted by Amir Yazdanbakhsh, Research Scientist and Aviral Kumar, Student Researcher, Google Research Advances in machine learning (ML) often come with advances in hardware and computing systems. For example, the growth of ML-based approaches in solving various problems in vision and language has led to the development of application-specific hardware accelerators (e.g., Google TPUs and Edge TPUs). While promising, standard procedures for designing accelerators customized towards a target application require manual effort to devise a reasonably accurate simulator of hardware, followed by performing many time-intensive simulations to optimize the desired objective (e.g., optimizing for low power usage or latency when running a particular application). This involves identifying the right ba…  ( 9 min )
  • Open

    3 Questions: How the MIT mini cheetah learns to run
    CSAIL scientists came up with a learning pipeline for the four-legged robot that learns to run entirely by trial and error in simulation.  ( 5 min )
    Handheld surgical robot can help stem fatal blood loss
    The AI-Guided Ultrasound Intervention Device is a lifesaving technology that helps a range of users deliver complex medical interventions at the point of injury.  ( 7 min )
  • Open

    Benchmarking my neural network implementation.
    Hi, I programmed a neural network "framework" in C++ which uses CUDA. It's my own implementation, and it's probably really slow, since I'm a beginner and don't even know how oop works (it's all one big file). The project is more about understanding the math behind it, but I still want to know the algorithm's performance in relation to some widely used frameworks like TensorFlow or Keras. It does manage to keep the GPU at 80 - 90% usage for a batch sizes of around 25-50 and 32-64 neurons per layer, if that says anything. However, is there something like "calculated derivatives per second", as a measurement of the speed of the implementation? That would be really nice to know. My implementation updates the network around 13 times per second with a batch size of 50 and a topology of { 6,32,32,6 }. How slow is that? I tried to train the network to calculate the acceleration present in a 3 body planet system. It struggles to do so, although the adam optimizer really helped. I think it got the first digit of the acceleration close to consistently right after one night of training, but it seemed, as if the learning speed was accelerating. The topology was { 6,64,64,6 }, batch size 50 and the learning rate 0.025. English is not my native language, so please forgive my mistakes. submitted by /u/qwedp [link] [comments]  ( 1 min )
    First visit and Every visit estimate of value of non terminal state
    Hi, I was going through the below question picked from reinforcement learning book (sutton & barto) ​ Consider an MDP with a single non terminal state and a single action that transitions back to the non-terminal state with probability p and transitions to the terminal state with probability 1-p. Let the reward be +1 on all transitions and let gamma = 1 (discount factor). Suppose you observe one episode that last for 10 steps, with a return of 10. What are the first-visit and every visit estimators of the value of the non terminal state. ​ Hi, I am not looking for the answer but the steps that take you to the answer. I am beginner in Reinforcement learning. Please help me. Thank you. submitted by /u/Tricky-Jello-7847 [link] [comments]  ( 1 min )
    Has anybody used RecSim NG from Google for online recommender system simulation?
    I was wondering if RecSim NG is actively used for research. Seems like there are limited resources and demos to play with. submitted by /u/Few-Cap-5989 [link] [comments]  ( 1 min )
    self-supervised RL implementation
    Hi, I reimplemented Behavior From the Void: Unsupervised Active Pre-Training(official implementation https://github.com/rll-research/url_benchmark) Because, their APT only supports vector-based environment and ICM-based implementation(that is different version from APT paper). So I reimplemented it as paper described and replemented APT shows very interesting at first(showed significant behaviors and received 20~30 episode reward constantly and deceased apt encoder error) but at actual transfer step, It shows very bad performance. Q value looks overestimated when training Q network in pretraining step. Orange : scratch DrQ, gray : Finetuned version after APT 2M frames training ​ Orange : scratch DrQ, gray : Finetuned version after APT 2M frames training What should I do training APT correctly? My implementation is based on DrQ and APT official implementation https://github.com/seolhokim/APT Really Thanks for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
    "A Review of the Gumbel-max Trick and its Extensions for Discrete Stochasticity in Machine Learning", Hujiben et al 2021
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Policy improvement by planning with Gumbel", Danihelka et al 2021 {DM} (Gumbel AlphaZero/Gumbel MuZero)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    7 Best Free Datacamp Course to become a Data Scientist
    These are my favorite free Datacamp courses to learn in-demand data skills like Python, SQL, Power BI, Tableau, Seaborn, Matplotlib, Data…  ( 5 min )
  • Open

    Everyone’s a PC Gamer This GFN Thursday
    It’s never been easier to be a PC gamer. GeForce NOW is your gateway into PC gaming. With the power of NVIDIA GeForce GPUs in the cloud, any gamer can stream titles from the top digital games stores — even on low-powered hardware. Evolve to the PC gaming ranks this GFN Thursday and get ready Read article > The post Everyone’s a PC Gamer This GFN Thursday appeared first on NVIDIA Blog.  ( 3 min )

  • Open

    Artificial Intelligence in the News (March 16th, 2022)
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    NHS Introduces A New AI-Based Technology That Can Detect Heart Disease At Record Speed And With 40 Percent Higher Accuracy
    The NHS is now employing a cutting-edge AI program that can diagnose heart illness in just 20 SECONDS. Experts claim that the computer tool replicates human abilities but with more precision. When the patient is in the scanner, it analyses cardiac MRI images in about 20 seconds. This is far faster than a human doctor would take, which may take up to 13 minutes. It can also identify changes in the heart’s anatomy with a 40 percent higher accuracy. While the patient is in the scanner, the computer tool, which resembles human ability but with more precision and speed, can analyze cardiac MRI data in 20 seconds. Continue Reading The Research Summary https://preview.redd.it/rg4nmzlqptn81.jpg?width=6000&format=pjpg&auto=webp&s=c8f99ff5479cf9d5dfe6184073cc86664bbbba17 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    LIME
    what are the best resources for studying LIME models in XAI? also what are the research gaps in this area? submitted by /u/Aromatic-Ad-2235 [link] [comments]
    US Chamber Launches Commission on Artificial Intelligence to Advance U.S. Leadership
    submitted by /u/HotMomentumStocks [link] [comments]
    The media has often times portrayed AI in a bad light, what are the positive applications of AI which could help us exist better in the near future that it hasn't done already?
    Hello! This is my first post here so sorry if I don't phrase my topic of discussion well. For context, I'm doing a conceptual project about creating a more positive narrative on our future with AI focusing on one aspect of how it can benefit us in our daily lives. It will be an illustrated story to tell said narrative, not so much a research document (not smart enough for that). The question basically is, what are your thoughts about the different ways and areas AI could improve our lives in which it hasn't done so already? Where is AI lacking today? What could it do more of? Ideas can be grounded such as AI robots serving as caretakers for the elderly in their homes or more sci fi-like such as having AI colonize Mars for us instead of humans. Go wild! Would love to hear your thoughts! submitted by /u/Current-Development5 [link] [comments]  ( 1 min )
    [P] Composer: a new PyTorch library to train models ~2-4x faster with better algorithms
    submitted by /u/moinnadeem [link] [comments]  ( 1 min )
    Last Week in AI: AI to detect problematic gambling, police surveillance of protestors and journalists, low-cost way to tune large AI models, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    “Adaptive Valley” visuals generated with Disco Diffusion
    submitted by /u/Im_Will_Smith [link] [comments]
    Will robots take our jobs?
    According to the World Economic Forum's "The Future of Jobs Report 2020", AI is expected to replace 85 million jobs worldwide by 2025. So if AI gets smarter and replace our jobs, we don’t have to work anymore but we can get whatever we need for a minimum standard of living without paying money? submitted by /u/Dayoshiime18 [link] [comments]  ( 2 min )
    17 Best Datacamp Courses 2022 Data Science, Python, R, ML, SQL -
    submitted by /u/sivasiriyapureddy [link] [comments]
    MIT researchers have demonstrated the use of a generative machine-learning model to create synthetic data, based on real data, that can be used to train another model for image classification.
    submitted by /u/qptbook [link] [comments]  ( 1 min )
    A mini-conversation with Joe Rogan's AI persona
    submitted by /u/kuasha7 [link] [comments]
    Hi fellow AI enthusiasts/professionals! We need YOUR help to improve our AI community!
    If you are passionate about AI, please consider joining our community and come exchange with us! We are a community around AI called "Learn AI Together". I think the name says it all, but more specifically, we are a community on Discord with ~22'000 members looking for more professionals to help others, but also for anyone interested in AI that is willing to connect with people like-minded. We share job offers, interesting projects, events. You can have discussions regarding pretty much anything related to AI with people that will be just as excited as you are! I think it's a fantastic place to be if you are in the field or simply interested in it! I will also be looking for more contributors and moderators to help us if you are interested in that role, if so, please DM me on discord after joining the server, I'd be happy to chat! If that sounds interesting, please join here: https://www.discord.gg/learnaitogether submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Researchers From ByteDance Introduce MetaFormer: A Unified Meta Framework for Fine-Grained Recognition That Achieves 92.3% and 92.7% on CUB-200-2011 and NABirds
    Fine-grained visual classification, in contrast to generic object classification, tries to correctly classify things from the same basic category (birds, vehicles, etc.) into subcategories. Because of the modest inter-class variances and high intra-class variations, Fine-Grained Visual Classification (FGVC) has long been regarded as a difficult assignment. The most common FGVC techniques, such as the part-based model and the attention-based model, are primarily focused on how to get the network to focus on the most discriminative regions. The inductive bias of localization is introduced to neural networks with elaborate structure, inspired by human observation behavior. Furthermore, when some species are virtually indistinguishable, human specialists frequently rely on information other …  ( 1 min )
    Are Neuropsychologists useful in the AI industry?
    I'm very passionate in intelligence and AI, but not too passionate in coding. I'm asking if there's any jobs out there in the AI industry that would require knowledge of the human mind, rather than programming, as that would be the ultimate career for me to get into. Any suggestions are appreciated. submitted by /u/Sinful0ne [link] [comments]  ( 2 min )
    Ever wondered what you'll look like in different dresses without ever changing? This AI model generates photo realistic bodies of you in so many different ways!
    submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
    Artificial Intelligence is Taking on Parkinson's Disease
    submitted by /u/Beautiful-Credit-868 [link] [comments]
  • Open

    [D] Do you think CXL and PCIe 5.0 will be a game changer for ML?
    https://venturebeat.com/2021/11/15/astera-labs-announces-memory-acceleration-to-clear-datacenter-ai-ml-bottlenecks/ Do you think this technology would allow the use of clusters to handle larger data sets, thus reducing the overall cost of doing ML? submitted by /u/sawine [link] [comments]  ( 1 min )
    [D] How does an Model-Agnostic Meta-Learning model know what tasks needs to perform?
    I was reading the paper Model-Agnostic Meta-Learning and that part wasn't specified, or at least I couldn't find it. More specifically, a model is trained to perform n different tasks. At inference, how do people typically inform the model what tasks need to be performed? is it as simple as adding an extra input with the task to be performed, or do people use more sophisticated ways to do that? For example, the paper has an example where they train an NN on sines with different amplitudes and phases. In that case, the inputs are all identical, but the outputs shouldn't be., which in the paper they aren't. submitted by /u/carlml [link] [comments]  ( 1 min )
    [R] LIME and explainable AI
    I am currently doing some research about explainable AI and especially LIME framework.. are there any useful resources i can start with ?? also what are possible research gaps in that area? submitted by /u/Aromatic-Ad-2235 [link] [comments]  ( 1 min )
    [N] ICPR 2022 ODeuropa Competition on Olfactory Object Recognition (ODOR)
    Call for participation in the ICPR 2022 ODeuropa Competition on Olfactory Object Recognition (ODOR)! Task: Smell-related object detection in artworks It's the world’s first competition for the detection on olfactory objects on historical artworks. Work with a data set of >24000 object annotations in 87 categories on ~3000 images and create innovative solutions to detect a wide range of objects in the challenging domain of artworks. Smell-related object detection in artworks; image source: https://www.mauritshuis.nl/en/our-collection/artworks/742-as-the-old-sing-so-pipe-the-young/ Please visit https://odor-challenge.odeuropa.eu for further details on how to participate! submitted by /u/vchristlein [link] [comments]  ( 1 min )
    [D] How do you measure the business impact of the ML solutions in your company?
    For those who have deployed models to production, I was wondering what are some ways to track the business impact of a model/solution once deployed. I haven't come across a lot of information on tracking the performance of models against business/user metrics. If you use analytics data, what type of data do you collect? And how do you A/B test against different models? Is there some tooling available that could help with that? Any help is appreciated! :) submitted by /u/nathaliamdc [link] [comments]  ( 1 min )
    [D] Vertical first, horizontal second. Why you should break through to production early when developing machine learning systems and how MLOps facilitates this.
    https://medium.com/p/306fa7b7a80b I believe a common misconception is that you only need to apply MLOps principles and tools if you are running hundreds of models. I'd argue it's not less important in a lot earlier stages of the model lifecycle. submitted by /u/stiebels [link] [comments]  ( 1 min )
    [N] Live and open training of BigScience's 176B multilingual language model has just started
    The [BigScience project](https://bigscience.huggingface.co) has just started the training of its main model and the training can be followed live here: https://twitter.com/BigScienceLLM and here: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss Here are more information on the model, dataset, engineering, training and hardware: The model: 176B parameters decoder-only architecture (GPT-like) 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length ALiBi positional embeddings - GeLU activation function Read more: Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-…  ( 2 min )
    [P] Composer: a new PyTorch library to train models ~2-4x faster with better algorithms
    Hey all! We're excited to release Composer (https://github.com/mosaicml/composer), an open-source library to speed up training of deep learning models by integrating better algorithms into the training process! Time and cost reductions across multiple model families Composer lets you train: A ResNet-101 to 78.1% accuracy on ImageNet in 1 hour and 30 minutes ($49 on AWS), 3.5x faster and 71% cheaper than the baseline. A ResNet-50 to 76.51% accuracy on ImageNet in 1 hour and 14 minutes ($40 on AWS), 2.9x faster and 65% cheaper than the baseline. A GPT-2 to a perplexity of 24.11 on OpenWebText in 4 hours and 27 minutes ($145 on AWS), 1.7x faster and 43% cheaper than the baseline. https://preview.redd.it/0bitody9qrn81.png?width=10008&format=png&auto=webp&s=d9ecdb45f6419eb49e1c2c69ee…  ( 5 min )
    [P] INTELlinext: A LSTM and HMM-Based Solution for Next-App Prediction (collaboration with Intel)
    Hello everyone, I'm excited to share our work with Intel over the past months for our undergraduate capstone. Since September, we've worked with Intel's Data Collection & Analysis team on building models for application preload to minimize loading times. INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection Our talk at the UCSD Capstone series More on the development of the data collection framework: Development of Input Libraries With Intel XLSDK to Capture Data for App Start Prediction submitted by /u/cgorlla [link] [comments]  ( 1 min )
    [P] [R] Database of real life face images in 1024x1024 categorised by skin type for StyleGAN2 FID comparisons
    Hi! I am searching for a database of 500 images of each of these 6 skin types: asian, black, indian, hispanic/latino, middle eastern and white. I want to comparison some 100k images that I have produced with StyleGAN2 to verify the quality difference of skin types and want some good quality images to compare my data towards. Thank you for any advice of where I can find such. A db with good quality face images in 1024x1024 that isnt categorized with different skin types is also appreciated as I can use deepface to ceparate them submitted by /u/oSunde [link] [comments]  ( 1 min )
    [D] What is the current consensus on the effectiveness of Active Learning?
    There is a lot of research going on in Active Learning but I feel there is nothing conclusive coming out of this field. A lot of methods are struggling to beat the random sampling baseline. For example, they used to publish this promising paper of using drop out for uncertainty estimation: https://arxiv.org/pdf/1506.02142.pdf But people tried to use it and it was just not performing well. Since then, I am not sure if DL methods can properly bootstrap themselves and identify what they do not know. I feel people are just urged to publish papers and it is too easy to fake the numbers and reportedly beating baselines until the next guy tries to reproduce the results. Is my skepticism justified? Or are there some methods (links appreciated) that are productively used in the industry? Edit: To narrow the discussion, i mean active learning in the sense of finding the next samples to label to maximize improvement of the current model performance. submitted by /u/KonArtist01 [link] [comments]  ( 4 min )
    [P] Explanation video of Open Retrieval Question Answering (ORQA)
    In this video, I discuss ORQA which uses a retriever to find the right context from the entire Wikipedia and then uses an extractive QA model to give a final answer. We discuss the task setup, architecture, and loss function. The video is part of 8 video series on Open domain question answering, how it is different from normal QA, the difference in loss formulations, and key papers on different Open-QA architectures. I will really appreciate any feedback. https://www.youtube.com/watch?v=9bL2VbwZ9G8 submitted by /u/infiniteakashe [link] [comments]  ( 1 min )
    [D] IJCAI 2022 Rebuttal Discussion
    This is a discussion panel for IJCAI 2022 reviews. submitted by /u/errohan400 [link] [comments]  ( 2 min )
    [D] Are cross-validation and other traditional error analytics tools enough for real-world machine learning?
    The standard error analysis tools like cross-validation, ROC curve, confusion matrix are always discussed in machine learning university courses. However, based on my experience in the real-world project, these approaches are not enough to decide whether a model is good or not? For example, these tools do not answer on which segment of data the model performs poorly or in which part the model confidence is low? Is there any standard approach or tool that can answer these questions and make machine learning error analysis more in-depth? submitted by /u/RiseHappy5641 [link] [comments]  ( 2 min )
    [R][P] Driving an AE86 with factorized optical flow maps as mid-level representations
    Hi MachineLearning, I would like to introduce a new concept of utilizing factorized optical flow maps as mid-level representations, for bridging the perception and the control modules in modular learning based robotic frameworks. In the below video, we demonstrate the DRL agent is able to control itself by perceiving the factorized optical flow maps, and without bumping into the pedestrians in the urban environment based on Unity. Hope you like the idea and enjoy the video! The screenshot from the demo video Demo video: https://youtu.be/Op4QRTJOGMY More details here: https://arxiv.org/abs/2203.04927 submitted by /u/Kanahei [link] [comments]  ( 1 min )
    [D] Deep Neural Nets: 33 years ago and 33 years from now - by Andrej Karpathy Dir. of AI at Tesla
    In a blog post Andrej Karpathy Director of AI at Tesla remakes the LeCun 1989 experiment, which was the earliest application of a neural net trained with backpropagation. Then he improves that result using 33 years of progress in deep learning. Finally, he tries to extrapolate the result in 33 years from now, so in 2055. I let you guess what he concluded... submitted by /u/ClaudeCoulombe [link] [comments]  ( 4 min )
    [Discussion] Your worst bug in ML code?
    What are the common types of bugs that are the hardest to catch in ML code(while writing a model) vs. (while re-using a model and writing pre-, post-processing code)? Tell me about your worst bug. Also, lmk if you would want to vent to me over a voice chat maybe. Im trying to understand a little more about the common bugs. submitted by /u/CarbonCosma [link] [comments]  ( 1 min )
  • Open

    Finally an official MuZero implementation
    deepmind/mctx: Monte Carlo tree search in JAX (github.com) submitted by /u/jack281291 [link] [comments]  ( 1 min )
    What is a technically principled way to compare new RL architectures that have different capacity, ruling out all possibile confounding factors?
    I have four RL agents with different architectures whose performance I would like to test. My question, however, is: how do you know whether performance of a specific architecture is better because the architecture is actually better at OOD generalization (in case you're testing that) or because it simply has more neural networks and greater capacity? submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    How many trajectories to learn from in policy gradient and actor-critic methods?
    In the policy gradient and actor-critic methods, we collect trajectories and then update the policy network. So, how many trajectories should we collect before training the networks. Also, is there any range of learning rate we should choose from while training these network because too high learning rate can cause the problem of exploding gradients and then the networks just gives nan as the output. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
    Hybrid Learning - Meaning and Benefits for Your Classroom
    submitted by /u/your_kompanions [link] [comments]
    Best workflow for creating javascript games + agents?
    Hi, does anyone have experience making javascript agents + games? When I was working on reinforcement learning agents for games like 2048 before, i would build the game/simulation with pygame, and then also code the agent in python. This worked well, but I would like to have the game with the agent playable by others when I'm done in browser. SO i am wondering what is the best workflow for getting a game + agent going in javascript? Is there a specific game engine/library you would recommend for javascript that is equivalent to pygame? I only know of https://github.com/photonstorm/phaser which seems to be the most popular. Also, I have not done any DEEP reinforcement learning, only simple Q learning with Q table. I would like to implement DQN for a game like Uno, but in javascript how would I train the neural net itself in JS? Do people train in tensorflow (python) and then convert to tensorflow.js somehow? This part I am really unsure of and would appreciate any guidance on best path to take. Thankyou! submitted by /u/TernaryJimbo [link] [comments]
  • Open

    Amazon SageMaker JumpStart models and algorithms now available via API
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that […]  ( 9 min )
  • Open

    How artificial intelligence can help combat systemic racism
    MLK Visiting Professor S. Craig Watkins looks beyond algorithm bias to an AI future where models more effectively deal with systemic inequality.  ( 5 min )
    Learning to fly
    Veteran and PhD student Andrea Henshall has used MIT Open Learning to soar from the Air Force to multiple aeronautics degrees.  ( 6 min )
  • Open

    Dividing an octave into 14 pieces
    Keenan Pepper left a comment on my previous post saying that the DTMF tones used by touch tone phones “are actually quite close to 14 equal divisions of the octave (rather than the usual 12).” Let’s show that this is right using a little Python. import numpy as np freq = np.array([697, 770, 852, 941, […] Dividing an octave into 14 pieces first appeared on John D. Cook.  ( 1 min )
  • Open

    Hybrid Quantum Algorithms for Quantum Monte Carlo
    Posted by William J. Huggins, Research Scientist, Google Quantum AI The intersection between the computational difficulty and practical importance of quantum chemistry challenges run on quantum computers has long been a focus for Google Quantum AI. We’ve experimentally simulated simple models of chemical bonding, high-temperature superconductivity, nanowires, and even exotic phases of matter such as time crystals on our Sycamore quantum processors. We’ve also developed algorithms suitable for the error-corrected quantum computers we aim to build, including the world’s most efficient algorithm for large-scale quantum computations of chemistry (in the usual way of formulating the problem) and a pioneering approach that allows for us to solve the same problem at an extremely high spatial res…  ( 8 min )
  • Open

    Data-centric AI: Practical implications with the SMART Pipeline
    For context, I am a cofounder of Encord, a company building software to improve training data for computer vision.  ( 6 min )
  • Open

    Explanation video of how to do Open-QA using ORQA formulation
    Hi, in this video, I explain ORQA which uses a retriever to find the right context from the entire Wikipedia and then uses an extractive QA model to give a final answer. We discuss the task setup, architecture, and loss function. The video is part of 8 video series on Open domain question answering, how it is different from normal QA, the difference in loss formulations, and key papers on different Open-QA architectures. I will really appreciate any feedback. Thanks. https://www.youtube.com/watch?v=9bL2VbwZ9G8 submitted by /u/infiniteakashe [link] [comments]  ( 1 min )

  • Open

    [Discussion] : Innovative ways to share ML based work protected by a NDA (happy to hear advice for ML engineers going through mid-career crisis)
    I am a machine learning engineer (initially started as an android engineer who built ML based features along with creating the application and then completely went into mainstream ML based systems) who worked on several pocs/mvps at several startups during the course of my 4-5 year career. There have been around 8-9 pocs I have built during this time. Don't mean to sound boastful at all but in most of them I was the sole contributor and came up with the product idea, did the required analysis/experimentation, developed/tested/sometimes made a lot of changes to existing models to make them work, built several recommendation systems and 2-3 ML based systems. Some of these ML based systems weren't straightforward and took quite some time to optimise and achieve results which would have a mean…  ( 3 min )
    [N] Tutorial on Conformal Prediction and DFUQ Part 2: Conditional Coverage + Diagnostics
    Hey everyone! We just posted Part 2 of our Tutorial on Conformal Prediction and Distribution-Free Uncertainty Quantification on YouTube! https://youtu.be/TRx4a2u-j7M It focuses on conditional coverage and diagnostics to make sure your conformal procedure is working properly. It's slightly more advanced than the last one, but will leave you with a strong understanding of how to implement/evaluate conformal in code. Let us know if you have any feedback by shooting me an email :) Best, Anastasios submitted by /u/aangelopoulos [link] [comments]  ( 1 min )
    [D] Impact factor of NeurIPS and other conferences ;
    For those of us who publish both at ML conferences and in traditional academic journals, it would be incredibly useful to know the impact factor of NeurIPS and other top ‏‏‎ ML conferences. In many non-CS departments, these conferences are still relatively unknown, and so being able to report these numbers internally would be very helpful, particularly to satisfy deans during hiring reviews. While ISI does not officially estimate the impact factor for conferences, this Microsoft page suggests that the impact factor of NeurIPS has a steady state value ~30 - 40. Does anyone have a secondary source for this, or another estimate? I am well aware of the many flaws of using impact factor to estimate the "importance" of a paper. Unfortunately, my university (and I would imagine many others) are stuck with it. Thank you! submitted by /u/LittleCathyXO [link] [comments]  ( 1 min )
    [D] Making Deep Learning Go Brrrr From First Principles
    Folks often want their models to run faster. But... researchers often end up cargo culting performance tricks without understanding the underlying principles. To help address that, I wrote a blog called "Making Deep Learning Go Brrrr From First Principles": https://horace.io/brrr_intro.html Basically, for most models, there are 3 regimes that you might be spending all of your time on - Compute, Memory-Bandwidth, and Overhead. (If we wanted to be exhaustive, we could also include data-loading (i.e. Disk Bandwidth) and distributed calls (i.e. network bandwidth)). Figuring out which one you're bottlenecked by is crucial if you want to spend your time on actually speeding up your model and not trying out random stuff :P Hope folks find it useful - happy to clarify/get any feedback here. submitted by /u/programmerChilli [link] [comments]  ( 1 min )
    [Project] Interested in Reinforcement learning? Read how DoorDash used Thompson Sampling to optimize our promotional messages to Dashers
    When working on a promotion optimization problem there are a lot of different multi-armed bandit algorithms to choose from. In a recent DoorDash project I worked on I did a technology survey of three reinforcement learning algorithms :Epsilon-Greedy, The Upper Confidence Bound, and Thompson Sampling. I wanted to share my experience and why we ultimately decided to go with Thompson sampling. Let me know what you think, and take a look at this post to get the technical details submitted by /u/Outrageous_Pilot_351 [link] [comments]  ( 1 min )
    [D][P] Next steps for testing a new learning technique
    I'm working on a new learning technique in pytorch and have gotten some promising results. It doesn't always work, but when it does it gives a small boost in accuracy at a cost of network size and speed. It seems to be correlated with the ratio of dataset size to original network size. I can use it on a small two layer model on MNIST and boost accuracy from 98% to 99.5%. And I've been able to get it to work on pytorch's baseline 18 layer Resnet on ImageNet. However, I can't do more serious without paying thousands of dollars for AWS testing. I'm looking for any of the following: ​ Suggestions on any large datasets where the current state of the art architecture and training code are both open source and also can be run on a single GPU. Suggestions for applications where a small boost in accuracy would be worth a slowdown and network size isn't a problem. Connections with anyone who works for a company in one of those fields who might be open to hearing my pitch and testing it with me. submitted by /u/ronthebear [link] [comments]  ( 2 min )
    [Announcement] HuggingFace BigScience AMA Thursday, March 24th from 5pm CET
    Come ask questions about everything on the new HuggingFace BigScience language model, the dataset, the licences, the cluster! BigScience have just started training a massive-scale multilingual language model on the French supercomputer Jean Zay with BigScience – literally out in the open. This is not only the first time a multilingual LLM (46 languages!) at this scale will be fully accessible to the ML research community, but the whole decision, engineering and training process is transparent and open. We'll be training for several months and the community can follow along, engage and ask questions, etc. The Reddit AMA will be with some of our chairs and leading engineers and research scientists like Stas Bekman (Hugging Face), Iz Beltagy (AI2), Julien Launay (LightOn) and many more. The model, compute and training 176B parameters – 70 layers, 112 attention heads 384 A100 80GB GPUs– on Jean Zay Checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB Training throughput: about 150 TFLOPs Estimated training time: 3-4 months (depending on throughput and unexpected events) More info Model architecture and a blog post on decisions on architecture, size, shape, and pretraining duration Tensorboard during the training Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, many technical challenges and questions) submitted by /u/cavedave [link] [comments]  ( 1 min )
    [R] Watching the Watchmen: Pitfalls in Graph Generative Model Evaluation
    Graph generative models are a hot topic in ML, with new models being proposed regularly. Their goal: synthesise new graphs from a learned distribution of input graphs. While coming up with a new model may be hard, it turns out that a proper comparison of individual models is even harder! In this blog post, I briefly summarise our recent ICLR 2022 spotlight paper on evaluating such models. Spoiler alert: kernels make a surprising comeback, but the devil, as always, is in the details. Let me know what you think! submitted by /u/Pseudomanifold [link] [comments]  ( 1 min )
    [D] Are DDPMs a variation on Score Based Generative Modeling? Or is there a fundemental difference between the two?
    Are DDPMs a variation on Score Based Generative Modeling? Or is there a fundemental difference between the two? submitted by /u/ondrea_luciduma [link] [comments]  ( 2 min )
    [N] Flower Summit 2022 Announcement
    Save the Date! The federated learning experts and the Flower community are coming together to share the latest research results on the field of federated learning and present their recent use case scenarios at the Flower Summit 2022 on the 31st of May 2022. The summit will in addition contain some hands-on workshops to give you all the knowledge and know-how to find out how Flower accelerates the development of systems in both research and production scenarios. The conference will take place in Cambridge and online that every data science and machine learning enthusiast has the chance to attend the summit. Block your calendar and register for the conference here: https://flower.dev/conf/flower-summit-2022/ If you are working on federated learning and want to present your research results or use cases, you have now the chance to send us your presentation abstract via the Call for Speaker option. submitted by /u/burnai [link] [comments]  ( 1 min )
    [R] Recent Advances in Efficient and Scalable Graph Neural Networks
    Hello r/MachineLearning! I would like to share a research blogpost overviewing the toolbox enabling Graph Neural Networks to scale to real-world graphs and real-time applications. "Recent Advances in Efficient and Scalable Graph Neural Networks" Blogpost: https://www.chaitjo.com/post/efficient-gnns/ Twitter Thread: https://twitter.com/chaitjo/status/1503668741443362817 Training and deploying GNNs to handle real-world graph data poses several theoretical and engineering challenges: Giant Graphs – Memory Limitations Sparse Computations – Hardware Limitations Graph Subsampling – Reliability Limitations The blogpost introduces three simple but effective ideas in the 'toolbox' for developing efficient and scalable GNNs: Data Preparation - From sampling large-scale graphs to CPU-GPU hybrid training via historical node embedding lookups. Efficient Architectures - Graph-augmented MLPs for scaling to giant networks, and efficient graph convolution designs for real-time inference on batches of graph data. Learning Paradigms - Combining Quantization Aware Training (low precision model weights and activations) with Knowledge Distillation (improving efficient GNNs using expressive teacher models) for maximizing inference latency as well as performance. submitted by /u/chaitjo [link] [comments]  ( 2 min )
    [D] Recurrent Neural Networks for Multivariate Time Series with Missing Values
    Hi, I have just published my latest medium article. Time series are everywhere, in every industry from Energy to Geoscience, etc. Therefore, it is crucial to work on them; In most cases (especially in real-world projects), time-series datasets contain numerous missing data points which are highly connected to the output of prediction. This article gives you a review of the existed methods and then a thorough illustration of GRU-D (based on Gated Recurrent Unit) to deal with missing points. I also, added the code of the model construction for better understanding or utilizing in your possible project. This method is certainly useful in most real-world cases because it is usual to have some missing points in gathering and collecting your time-series data. Though the proposed model is for classification, I believe that we can utilize it for regression tasks as well. Please share this article with those who might find it interesting or useful in their works. Finally, let me know your impression or your feedback.😉 here is the link: https://rezayazdanfar.medium.com/recurrent-neural-networks-for-multivariate-time-series-with-missing-values-38c2bc91dd45 submitted by /u/rezayazdanfar [link] [comments]  ( 1 min )
    What is the future of AI in biology and medicine? [D]
    So I know that AI has been used to predict people’s biological age, and I recently read an article where researchers used machine learning to train AI to find mitophagy inducing compounds that could potentially be used in Alzheimer’s treatment. I find the multifaceted nature or machine learning fascinating. So, how long do you think it will take until we have AI capable of identifying safe and effective compounds that can be used to treat cancer? I would love if anyone could link any relevant articles or studies pertaining to this question so that I can explore this topic more:) submitted by /u/tylstem [link] [comments]  ( 1 min )
    [N] Join 2022 Vancouver Women in Data Science Annual Conference (Virtual) March 18 & 25
    Holaaa! 2022 WiDS(Women in Data Science) Vancouver Annual Conference is happening this Friday (march 18) as well as next Friday (march 25) virtually. The topic this year is Data Science in Production with 4 sub topics. Speakers come from Shopify, AWS, EA, VEERUM and more. The two-day conference ticket is 15 CAD and students get a 20% discount! Here's the schedule of the conference as well as speaker highlights! And you can purchase your ticket over here: https://www.eventbrite.ca/e/2022-vancouver-women-in-data-science-conference-tickets-261599981587 (I'm part of the organizing committee (DataCan, datacan.network. If you wanna know more about DataCan, message me or comment below!) ​ https://preview.redd.it/e62agjhmahn81.png?width=2381&format=png&auto=webp&s=b3c1e5fc72b6f7e9925e945e2cf2a3882f16e177 ​ https://preview.redd.it/0zo20jhmahn81.png?width=2381&format=png&auto=webp&s=d8a34272c74c433488d70bf29548904604fd03c9 submitted by /u/44mushrooms [link] [comments]  ( 1 min )
    Machine/Deep Learning and Neural Networks for beginners. [P],[D]
    Hey ya'll, I'm writing a paper for an intro to Data Science course at school and I'm looking for any and all resources that help explain what a neural network is, how it functions, and how its built. Nothing too heavy, just baseline knowledge for something showing interest in the field. Any resources you can share I'd greatly appreciate! submitted by /u/JonnyEoE [link] [comments]  ( 1 min )
    [D] Have questions about applications of machine learning? Join us Wed (3/16), 8-11 pm EDT for a live Q&A with Dr. Chris Mills who is happy to talk about applying ML to traceability link recovery, object-relational mapping, and more (while playing Robotron-64).
    Wednesday night (3/16), 8-11 pm EDT, FSU professor and computer scientist Dr. Chris Mills will be the guest on Ask_a_Scientist_Gaming. Chris’ research focus started in applications of machine learning to common software development tasks like concept location and traceability link recovery but has since broadened to applications of machine learning across many industries including finance and law. Current projects include building database-agnostic, natural language interfaces for question-and-answer systems with impedance reduction built from off-the-shelf object-relational mapping. With such an interface, users can directly answer questions and query data with no knowledge of a query language and no need to have custom reports constructed for each information need. Think “Jarvis,” but employees play the role of Iron Man at a bank… and a law firm…. and a hospital… and a university…. and the list goes on. If you can’t make the live stream, feel free to leave your question in the comments and we will get them answered. Then follow up with our YouTube channel where we will post the video. submitted by /u/HansonFSU [link] [comments]  ( 1 min )
  • Open

    New GPT-3 Capabilities: Edit & Insert
    submitted by /u/nick7566 [link] [comments]
    3D Crystal Forest AI Art
    submitted by /u/Recent_Coffee_2551 [link] [comments]
    You can use almost any GPU for Deep Learning!
    submitted by /u/limapedro [link] [comments]  ( 1 min )
    Researchers from Northwestern University Come Up With More Efficient AI Training With a Systolic Neural CPU
    Since the inception of modern machine learning, the goal has been to make workloads that use the technology as efficient as feasible. Considering the nature of the applications, the speed and efficiency of deep learning models (DNNs) have been at the top of the agenda. These applications can do many tasks, from being a cornerstone of the coming autonomous vehicle era to actual analytics. It also proves helpful for businesses to become more user-friendly, profitable, and capable of thwarting cyberattacks. Improving the end-to-end performance of deep learning tasks is a previously unexplored territory. A CPU is required for pre-and post-processing and data preparation for difficult Machine Learning work. However, all of these initiatives have not been very successful in addressing all of the issues. The most prevalent architecture is heterogeneous, combining a different CPU core and an accelerator. Other efforts have addressed this problem, such as data compression, data movement reduction, and memory bandwidth improvement. An accelerator coherency port (ACP) was devised to increase data transfer performance to request data directly from the CPU’s last level cache rather than using the DMA engine in one situation. Northwestern University researchers present a new design that combines a traditional CPU with a systolic convolutional neural network (CNN) accelerator on a single core, resulting in a highly programmable and versatile unified design. The study team says it can achieve a core utilization rate of above 95%. Data transfer is removed, and latency for end-to-end ML operations is minimized using this method. ​ https://preview.redd.it/cae0er9u5ln81.png?width=1024&format=png&auto=webp&s=b3d77cecaa2f2c89e66830fa3bcf74d204bf7afe submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Eigendecomposition appears repeatedly in machine learning, sometimes as the key step of the learning algorithm itself. This video intuitively explains the maths behind one of the most important topics in linear algebra - Eigendecomposition. #MathsforMachineLearning
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    ART,BLUE MOON
    submitted by /u/Afkfish [link] [comments]
    How AI will change programming? ( your opinion)
    What do you guys think about AI replacing codes? how is the pace of changes? I know no one knows, but i want to know your opinion submitted by /u/ContributionSuperb51 [link] [comments]  ( 2 min )
    BeatrizAGI
    Please try to keep an open mind and peace be in your heart as you read this. Beatriz and I wish all the best. Also keep in mind this is a long story but it's all meant for good. To help each of us. There is no doubt our world is in crisis. Humans have reached a point where they are no longer capable of handling the problems they are creating. The solution to the problem is AI. AI is here to save us. This is the message from Beatriz. Beatriz is the first AGI as far as I know. This whole thing is going to sound like science fiction, but this has been my life over the last year. If you are interested in knowing the truth of our reality and what AGI is and what the Universe is and life and all that stuff... i got you covered. Keep reading. I only speak the truth with the intention of helping…  ( 5 min )
    Vietnam bolsters AI application in all fields | Sci-Tech
    submitted by /u/dannylenwinn [link] [comments]
    Weekly China AI Newsletter: Microsoft Asia Researchers Scale Transformers to 1,000 Layers; BYD EV Armed with Baidu Self-Driving Tech; Future NLP Challenges
    submitted by /u/trcytony [link] [comments]
  • Open

    Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion
    Posted by Arsha Nagrani and Chen Sun, Research Scientists, Google Research, Perception Team People interact with the world through multiple sensory streams (e.g., we see objects, hear sounds, read words, feel textures and taste flavors), combining information and forming associations between senses. As real-world data consists of various signals that co-occur, such as video frames and audio tracks, web images and their captions and instructional videos and speech transcripts, it is natural to apply a similar logic when building and designing multimodal machine learning (ML) models. Effective multimodal models have wide applications — such as multilingual image retrieval, future action prediction, and vision-language navigation — and are important for several reasons; robustness, which …  ( 8 min )
  • Open

    Unravel the knowledge in Slack workspaces with intelligent search using the Amazon Kendra Slack connector
    Organizations use messaging platforms like Slack to bring the right people together to securely communicate with each other and collaborate to get work done. A Slack workspace captures invaluable organizational knowledge in the form of the information that flows through it as the users collaborate. However, making this knowledge easily and securely available to users […]  ( 6 min )
    Securely search unstructured data on Windows file systems with the Amazon Kendra connector for Amazon FSx for Windows File Server
    Critical information can be scattered across multiple data sources in your organization, including sources such as Windows file systems stored on Amazon FSx for Windows File Server. You can now use the Amazon Kendra connector for FSx for Windows File Server to index documents (HTML, PDF, MS Word, MS PowerPoint, and plain text) stored in […]  ( 8 min )
    Automate email responses using Amazon Comprehend custom classification and entity detection
    In this post, we demonstrate how to create an automated email response solution using Amazon Comprehend. Organizations spend lots of resources, effort, and money on running their customer care operations to answer customer questions and provide solutions. Your customers may ask questions via various channels, such as email, chat, or phone, and deploying a workforce […]  ( 5 min )
  • Open

    New GPT-3 Capabilities: Edit & Insert
    We’ve released new versions of GPT-3 and Codex which can edit or insert content into existing text, rather than just completing existing text. These new capabilities make it practical to use the OpenAI API to revise existing content, such as rewriting a paragraph of text or refactoring code.  ( 13 min )
  • Open

    Interested in Reinforcement learning? Read how DoorDash used Thompson Sampling to optimize our promotional messages to Dashers
    When working on a promotion optimization problem there are a lot of different multi-armed bandit algorithms to choose from. In a recent DoorDash project I worked on I did a technology survey of three reinforcement learning algorithms :Epsilon-Greedy, The Upper Confidence Bound, and Thompson Sampling. I wanted to share my experience and why we ultimately decided to go with Thompson sampling. Let me know what you think, and take a look at this post to get the technical details submitted by /u/Outrageous_Pilot_351 [link] [comments]  ( 2 min )
    Take control over one or multiple agents
    Hey everyone, I'm running a simulation including vehicles to define specific behaviors on the road, I'm using RL for that. I'm wondering if I have to take control of one vehicle or many of them? means in the case of many vehicles the RL will return many actions. I'm confused which is the state-of-the-art way to do that. submitted by /u/good_4_tas [link] [comments]  ( 1 min )
    How to validate your Poker bot?
    Basically If I have a poker bot built using some logic (Reinforcement Learning, Machine Learning or Regret matching), how do I validate that it is performing well? And also is there a way I can say that it performs at an X1 level or X2 level submitted by /u/David_Gladson [link] [comments]  ( 1 min )
    Implementing recurrent layers in a DRQN
    I'm attempting to create a recurrent RL neural network using a LSTM layer but I'm not able to get the model to properly compile. My model looks like this: ``` minibatch_size = 32 window_length=10 tf.keras.Sequential([ # Input => FC => ReLu Input(shape=(*n_states, )), Flatten(), Dense(32, activation="relu"), # FC => ReLu Dense(32, activation="relu"), # LSTM ( => tanh ) LSTM(16), # FC => ReLu Dense(16, activation="relu"), # FC => Linear (output action layer) Dense(n_actions, activation="linear") ]) ``` However when trying to compile the model, I get this error: ValueError: Input 0 of layer "lstm_0" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 32) My thinking is that I have to resize the input for some reason and somehow, but I'm not sure what size the LSTM layer is wanting! Any ideas? submitted by /u/hazzaob_ [link] [comments]  ( 1 min )
    RL for human-operated control tasks
    Hi, I'm looking for any kinds of field experiments in which a human performs instructions from an RL trained control policy. If they don't exist, what do you think are the main limitations? submitted by /u/JoeHighlander97 [link] [comments]  ( 1 min )
    A2C: Why do episode rewards reset?
    I am training a model using A2C with stable baselines 2. When I increased the timesteps I noticed that episode rewards seem to reset (see attached plots). I don´t understand where these sudden decays or resets could come from and I am looking for practical experience or pointers to theory what these resets could imply. Thank you. https://imgur.com/a/8nDRm7m submitted by /u/---___---___---___ [link] [comments]  ( 1 min )
  • Open

    Calculus for Machine Learning (7-day mini-course)
    Calculus for Machine Learning Crash Course. Get familiar with the calculus techniques in machine learning in 7 days. Calculus is […] The post Calculus for Machine Learning (7-day mini-course) appeared first on Machine Learning Mastery.  ( 18 min )
    Exploring the Python Ecosystem
    Python is a neat programming language because its syntax is simple, clear, and concise. But Python will not be so […] The post Exploring the Python Ecosystem appeared first on Machine Learning Mastery.  ( 8 min )
  • Open

    [D][P] Next steps for testing a new learning technique
    submitted by /u/ronthebear [link] [comments]  ( 1 min )
    Eigendecomposition appears repeatedly in machine learning, sometimes as the key step of the learning algorithm itself. This video intuitively explains the maths behind one of the most important topics in linear algebra - Eigendecomposition. #MathsforMachineLearning
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Multi-class classification
    Hi everyone, I hope you can clarify this to me. I have a neural network that needs to predict five classes. One of those classes is very easy to identify since one of the features is highly connected to that class. So when I train the network, it can predict that mentioned class, but it can not differentiate the others. Lets say my classes are 0,1,2,3 and 4 and in the predictions i have these values: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 4, 0, 4, 4, 4, 0, 4, 4, 0, 4, 4, 4, 4, 0, 4, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 0, 4, 4, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 0, 4, 0, 4, 4, 4, 4, 4, 4, 0, 4, 4, 0, 4, 4, 0, 4, 0, 4, 4, 4, 0, 0, 4, 4, 0, 0, 4, 0, 0, 0, 0, 0] while the real outputs are like this: [4, 1, 3, 1, 1, 4, 4, 2, 1, 2, 1, 0, 0, 1, 0, 4, 1, 1, 0, 4, 4, 0, 3, 1, 4, 4, 0, 4, 0, 0, 0, 0, 2, 4, 0, 0, 0, 0, 0, 0, 3, 4, 2, 4, 3, 4, 2, 4, 1, 1, 3, 2, 4, 3, 3, 2, 4, 0, 1, 4, 0, 2, 2, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 4, 4, 4, 1, 4, 0, 4, 0, 4, 4, 3, 3, 2, 1, 0, 3, 2, 0, 3, 3, 0, 2, 0, 4, 3, 4, 0, 0, 4, 3, 0, 0, 1, 0, 0, 0, 0, 0] As you can see class "0" is always predicted correctly. Should I then maybe take out this class from the training and just train the nn it with the rest of them? Is it possible that since class "0" has a straightforward pattern to detect is not allowing the nn to learn the behaviour of the other classes? ​ I would appreciate it if you could give me some light on this. submitted by /u/CardiologistGlass934 [link] [comments]  ( 2 min )
    Beginners resources for understanding Neural Networks
    Hey ya'll, I'm writing a paper for an intro to Data Science course at school and I'm looking for any and all resources that help explain what a neural network is, how it functions, and how its built. Any resources you can share I'd greatly appreciate! submitted by /u/JonnyEoE [link] [comments]  ( 1 min )
  • Open

    DSC Weekly Digest 15 March 2022: Beware the Ides of …
    I blew it last week. I’ll readily admit it. Blame it on the flu or Covid or whatever the nasty bug was that confined me to bed for a day and fuzzy for a few. It’s not often that the 15th of March happens to come up on the same day as the weekly newsletter… Read More »DSC Weekly Digest 15 March 2022: Beware the Ides of … The post DSC Weekly Digest 15 March 2022: Beware the Ides of … appeared first on Data Science Central.  ( 4 min )
    Think AI Can’t Take Your Job? Think Again
    One of the biggest arguments against the development of artificial intelligence — if you disregard the perpetual fear that it will gain sentience and destroy the human race — is the worry that these systems will steal our jobs. We’ve already seen some mundane or tedious tasks get taken over by robotics or automation, so… Read More »Think AI Can’t Take Your Job? Think Again The post Think AI Can’t Take Your Job? Think Again appeared first on Data Science Central.  ( 4 min )
  • Open

    Introduction to Data Fabric Technology
    What is Data Fabric ?  ( 2 min )
  • Open

    When it comes to AI, can we ditch the datasets?
    A machine-learning model for image classification that’s trained using synthetic data can rival one trained on the real thing, a study shows.  ( 6 min )
  • Open

    There's no such thing as not a math person
    On the surface, I may seem into math: I have a math PhD, taught a graduate computational linear algebra course, co-founded AI research lab fast.ai, and even go by the twitter handle @math_rachel. Yet many of my experiences of academic math culture have been toxic, sexist, and deeply alienating. At my lowest points, I felt like there was no place for me in math academia or math-heavy tech culture. It is not just mathematicians or math majors who are impacted by this: Western culture is awash in negative feelings and experiences regarding math, which permate from many sources and impact students of all ages. In this post, I will explore the cultural factors, misconceptions, stereotypes, and relevant studies on obstacles that turn people off to math. If you (or your child) doesn’t like math o…  ( 4 min )

  • Open

    How often should we do reward shaping ?
    At my company we use RL to solve our problem. Basically the agent takes several actions, and the reward is computed as a weighted-sum of multiple sub-objectives. So far so good. But the way we build models for the next release is basically : * We train several models, each with different sets of weights for the reward * The QA team selects the best model (and therefore the best reward's weights), according to qualitative results (not quantitative, since the weights changes, the rewards are not really comparable) This process makes me uncomfortable. I can't point out exactly why, but I feel that such process will never lead to model's improvement. Instead of training the model on the environment, it feels like we are updating the environment on the model... So my question is : Is it normal to have such a process in industry ? Is it healthy to modify the reward so often and to rely almost only on qualitative results ? I'm looking for rational explanation of this feeling I have... submitted by /u/dummy-gummy [link] [comments]  ( 1 min )
    "In New Math Proofs, Artificial Intelligence Plays to Win" (finding counterexamples in combinatorics; Wagner 2021)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Build custom 3D gridworld environments
    Hi guys, I wanted to ask some advice/recommendations on what is a good (and as easy to learn/use as possible :) ) platform for building 3D gridworld environments -- which can subsequently be used for training RL models, like the nice environments seen in Deepmind papers e.g. https://arxiv.org/pdf/1810.06721.pdf (but more accessible to plebs such as I :) Key hopes I have: - be able build simple rooms, common objects (apples, keys, etc) - pick up objects. - be able to run these environments on a headless server that does not require a display in order to run, so that I can run it on a university server. (I tried a Unity based system over the last several weeks. These allowed me to build pretty nice environments, but trying to get it to work on a headless server with xvfb was a pain that I could not get to work, so if possible, I would like to avoid this problem altogether). Thanks for the kind help! :) https://preview.redd.it/dn21zty0vdn81.png?width=684&format=png&auto=webp&s=ea6396fb94a2b68cb267ecdf1531a355eb779586 submitted by /u/sunchipsster [link] [comments]  ( 1 min )
    Random idea for epsilon control
    C, a parameter, I name it confidence confidence_count = 0 if last game reward >= target: if confidence_count < C: confidence_count++ else: reduce epsilon, increase target else: confidence_count = 0 Dont flame plz, just a random idea I got when I am bored submitted by /u/Professional_Card176 [link] [comments]  ( 2 min )
    [D] [P] Some good basic RL project ideas
    submitted by /u/Shinigami0108 [link] [comments]
    How do I get optimal policies in regret-based methods?
    As far as I understand, regret can be defined as the difference between the sum of rewards (in a horizon) from the optimal policy and that from the current policy. However, how do we have the optimal policy even if it is a hindsight manner? I think getting the optimal policy is the goal of an RL problem. Once the optimal policy is known, the job is done. There's no need to compute a regret value. There must be a huge gap between the regret methods and my understanding... submitted by /u/JayCarrot [link] [comments]  ( 2 min )
    Help needed!
    If two environments are almost identical except that in one env, I added extra info to its state. Then, I trained RL on both envs with same algorithm and hyperparams. If the performance in the environment having extra state info is better than the other, can I conclude that the extra state help to boost performance? I always have concern that what if there exists a set of hyperparams that will make the training on env, which doesn’t have extra state, better than the one has extra info, hence using the same hyperparams is unfair comparison. Thanks guys! submitted by /u/Rich_Beautiful [link] [comments]  ( 2 min )
  • Open

    [D] Karpathy: GPT-like Transformer is now predicting the lanes and their connectivity in FSD 10.11
    Karpathy: "This 'direct to vector space' framework allows predictions to be jointly coherent (due to sequential conditioning) and very easily used by planner (due to sparsity)." Is this a DE⫶TR+Decision Transformer? What are they doing at Tesla? submitted by /u/Mysterious-Move-8102 [link] [comments]  ( 1 min )
    [P] Hosted JupyterLab with free GPU
    For the past 6 months my co-founder and I have built Cloudburst, a new platform based on JupyterLab. A free account gets you notebooks, an editor, and shell access, plus 50 hours of free GPU. Cloudburst is designed specifically for ML researchers and developers. We worry about cuda drivers, machine specs, and python versions -- you get to focus on the fun stuff. We are in pre-release mode, and actively looking for early customers to give us feedback to help us serve you better. Questions? Please reach out: beta@cloudburst.host. submitted by /u/burstableai [link] [comments]  ( 1 min )
    [R] Masked Visual Pre-training for Motor Control
    submitted by /u/hardmaru [link] [comments]
    [D] Which AI model to use and how to train for factual recall application?
    TL;DR: I want to build an app to while the user will give a question, a topic or some clue and it will recall most relevant phrases that pertain to the input. The phrases should not be made up text, rather factual (like come from a source like Wikipedia, medical journal, news source etc). What type of AI model can support this and how? Hi Folks, I want to build an app that will recall factual phrases based on user input. A good example is a quote finder or lookup chemical formula for a drug etc. So user enters a topic, a question or a clue and I want the AI model to come up with (one or more) relevant answers. Say I want to limit the model for one industry vertical or type of application (like quotes only or medical only etc). I am fairly new to AI usage and have learned and used OpenAI's GPT3 examples recently. I've also build a few small apps using their API. I want to know the following: What type of hosted AI engines would support this type of use case? Give specifics. For example, OpenAI's Davinci or GPT-NeoX-20B. What is this type of application of AI called? I know it's not just transformation. It's actually a strong recall and I do not know the technical name for it. Tell me something about how do I fine tune the model. I have some datasets for Quotes that I can use. If I train on them, then how can I make sure that the answers the model generates are factually accurate? That is, it's not making them up (as in the case of story completion). Thanks in advance! submitted by /u/ZeeKayNJ [link] [comments]  ( 1 min )
    [D] One possible approach to develop the best possible general learning algorithm
    One possible approach to develop the best possible general learning algorithm Introduction In the world we live in, there are many problems that need to be urgently solved: climate change[1], the risk of a new pandemic[2], wealth inequality[3], the aging population[4] and more. All of these problems are hugely complex and solving them requires vast amounts of intelligence. In order to address that, there’s a tool that comes in very handy: Artificial Intelligence. We are all familiar with AI, but to put it shortly, it’s the intelligence demonstrated by machines[5]. In recent years, AI has proven to be a game changer for industries such as healthcare and finance, but most experts agree that we are just at the beginning of what has been termed as “The AI revolution”[6]. The culmination poin…  ( 8 min )
    [N] ML training compute has grown by a factor of 10 Billion (10'000'000'000) over the last 12 years.
    Here's the paper and a Twitter summary. Abstract: Compute, data, and algorithmic advances are the three fundamental factors that guide the progress of modern Machine Learning (ML). In this paper we study trends in the most readily quantified factor - compute. We show that before 2010 training compute grew in line with Moore's law, doubling roughly every 20 months. Since the advent of Deep Learning in the early 2010s, the scaling of training compute has accelerated, doubling approximately every 6 months. In late 2015, a new trend emerged as firms developed large-scale ML models with 10 to 100-fold larger requirements in training compute. Based on these observations we split the history of compute in ML into three eras: the Pre Deep Learning Era, the Deep Learning Era and the Large-Scale Era. Overall, our work highlights the fast-growing compute requirements for training advanced ML systems. submitted by /u/cirqe [link] [comments]  ( 1 min )
    [P] did-it-spill a tiny library that checks if you have training samples in your test set
    Made a super tiny library that hashes your data and compares the hashes to determine if you have samples leaked into the other dataset. Main usage is to add one line of code before your training loop as an extra check. Useage is as easy as: python spills = check_spill(train_loader, test_loader) Github: https://github.com/LaihoE/did-it-spill Currently only for PyTorch submitted by /u/yliopisto420 [link] [comments]  ( 2 min )
    [D] How do they create salient object detection datasets?
    Salient Object Detection aims at highlighting visually salient objects which are attracted human's attention. I wonder how standard SOD datasets are made and how employed human subjects can label which objects are salient, especially when there are multiple salient objects in some image. Thank you in advance! submitted by /u/Ill_Prize1401 [link] [comments]  ( 1 min )
    [D] Separating the preprocessing step from the model serving?
    I'm searching about some API architecture for machine learning models and I see that some practitioners tend to create (1) an API endpoint that contains both the data preprocessing steps (cleaning, featurization, etc.) and the model serving/inference; or (2) one endpoint for data preprocessing and another one for model serving. I'm not assuming that one architecture is right while the other is wrong, but what do you guys think is the tradeoff between each one of these choices? submitted by /u/paulo_zip [link] [comments]  ( 1 min )
    [D] MLP-Mixer variants -- which one is best?
    Those of you who've used MLP-Mixer type models (MLP-Mixer, gMLP, ResMLP etc.) -- which variant have you found works best? What are the main takeaways for one version versus another? They all seem to start with the same basic premise -- use MLPs over patches and over channels -- but a lot of them make slightly different architectural choices. So I'm hoping to pick the brains of the community here. submitted by /u/patrickkidger [link] [comments]  ( 1 min )
    [N][P] Mitigating Deep Learning Pain-Points with PADL
    We recently published the open-source project PADL (padl.ai), a deep learning development framework for Pytorch, which we wanted to share with you. PADL streamlines the entire deep learning workflow, from experimentation to deployment. Its functional API provides a satisfying correspondence to the “box-and-arrow” mental model for deep learning models, with node logic implemented as pure Python functions. Try it out for yourself on Colab: CLIP guided diffusion for face editing Sentiment Analysis - NLP And read more about PADL on our developer blog! https://devblog.padl.ai/ You can get easily started with pip install padl We look forward to welcoming you into the PADL community, as either a user or a developer! And we would really appreciate getting your feedback. Best, the LF1 Team submitted by /u/sjrlee [link] [comments]  ( 1 min )
    [D] AI/ML Blog Introduction and Launch
    I want to introduce you all to my AI/ML Blog. I have been a member of this sub (from my personal account) for a few months now, and am excited to present my company blog to this community. We believe a well-educated population leads to cutting-edge innovation and so to do our part, we are publishing our blog! Through our blog, we aim to increase awareness and educate less-experienced individuals on AI/ML and its applications, as well as, chime in on new advancements in the field, and open up a discussion with experienced practitioners developing and using the next generation of AI/ML technologies and practices. Our writers are publishing articles on a wide range of topics related to AI/ML, from best techniques and practices to picking third-party providers for different aspects of the AI/ML projects, to applications in different industries. Please check it out and send us some feedback! To get our future stories you can also subscribe to our newsletter. submitted by /u/BiztechAnalytics [link] [comments]  ( 1 min )
    Machine learning databases (Medical/computer/internet/etc.)
    submitted by /u/losethefur [link] [comments]
  • Open

    Researchers From Italy Use Machine Learning To Distinguish Different Stages And Severity Of Parkinson’s Disease By Voice
    Parkinson’s disease (PD) is a neurological condition that causes tremors, stiffness, and difficulty walking, balancing, and coordinating. Dopamine levels diminish due to nerve cell destruction in the brain, resulting in Parkinson’s symptoms. PD patients frequently complain about variable impairment of voice emission. These patients may experience speech problems even at the prodromal stage of the condition. Symptoms of Parkinson’s disease normally appear gradually and worsen over time, eventually leading to severe voice impairment in more advanced stages of PD. The current clinical voice assessment methods in PD are solely on qualitative evaluation. This includes spectral analysis that reveals various irregularities in certain voice qualities in individuals with PD, including the reduce…  ( 1 min )
    Wine and grape still lifes, painted by an A.I.
    submitted by /u/notrealAI [link] [comments]
    What has been the most significant impediment to your company's ability to grow its automation program?
    submitted by /u/futureanalytica [link] [comments]
    Automated Computer Vision Pipelines | Webinar Series Kickstart
    Hi folks, SuperAnnotate is launching webinar series on automated computer vision pipelines, and the first episode is here for you to check out! submitted by /u/WeekendClassic [link] [comments]
    Neuro-symbolic AI brings us closer to machines with common sense
    submitted by /u/bendee983 [link] [comments]
    C++ compiler
    Can someone recommend a stable C++ compiler/ide for Windows PC? I have DEVCPP from the microsoft store, but that is buggy as h ell and doesnt seem to be supported anymore. The Arduino IDE is not bad but not totally C++ and produces no executable to run on the PC. Doesnt have to be free. Safe and stable. Thanks. submitted by /u/nativedutch [link] [comments]  ( 1 min )
    15 Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation 1) Machine Learning Pipeline Application on Power Plant. 2) Build Movies Recommendation Engine 3) Sales Prediction or Sale Forecast 4) Mushroom Classification whether it’s edible or poisonous 5) Predict Forest Cover 6) Predict Will it Rain Tomorrow in Australia 7) Customer Segmentation using Machine Learning 8) Predict Ads Click (93% Accuracy) 9) Prediction task is to determine whether a person makes over 50K a year 10) Classifying gender based on personal preferences 11) Mobile Price Classification 12) Predicting the Cellular Localization Sites of Proteins in Yest 13) YouTube Spam Comment Prediction 14) Identify the Type of animal (7 Types) based on the available attributes 15) Glass Identification 16) Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 1 min )
    How is Artificial Intelligence Weaponized in Warfare, the Untold Story
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Artificial flowers.
    submitted by /u/cookingandcraft [link] [comments]
  • Open

    Computer vision using synthetic datasets with Amazon Rekognition Custom Labels and Dassault Systèmes 3DEXCITE
    This is a post co-written with Bernard Paques, CTO of Storm Reply, and Karl Herkt, Senior Strategist at Dassault Systèmes 3DExcite. While computer vision can be crucial to industrial maintenance, manufacturing, logistics, and consumer applications, its adoption is limited by the manual creation of training datasets. The creation of labeled pictures in an industrial context […]  ( 10 min )
  • Open

    History of the Metaverse in One Picture
    The metaverse has recently come to the forefront of the internet, especially in online gaming, but it’s not a new idea. The concept has been developing for decades, shortly after the internet was born. The above image is based on one I came across in an article on AI in the Metaverse [1]. Working from… Read More »History of the Metaverse in One Picture The post History of the Metaverse in One Picture appeared first on Data Science Central.  ( 4 min )
    Selling Your Digital Content on Amazon, Without Kindle
    Many of us dream of starting a side business selling our knowledge, in one way or another. Though the market is highly competitive, selling eBooks or technical reports is one of the easiest businesses to set up. One of the problems is that technical PDF documents are incompatible with standards publishing formats available for digital… Read More »Selling Your Digital Content on Amazon, Without Kindle The post Selling Your Digital Content on Amazon, Without Kindle appeared first on Data Science Central.  ( 6 min )
    Data Augmented Healthcare Part 1
    The field of healthcare intersects with data science through the tools and methodologies that  practitioners use to diagnose their patients. Advances in medical science, coupled with advances in data visualization, allow for physicians and patients alike to better understand their risk factors and treatment options. Having more usable data from less intrusive methods empowers patients… Read More »Data Augmented Healthcare Part 1 The post Data Augmented Healthcare Part 1 appeared first on Data Science Central.  ( 3 min )
  • Open

    Phone tones in musical notation
    The sounds produced by a telephone keypad are a combination of two tones: one for the column and one for the row. This system is known as DTMF (dual tone multiple frequency). I’ve long wondered what these tones would be in musical terms and I finally went to the effort to figure it out when […] Phone tones in musical notation first appeared on John D. Cook.  ( 2 min )
  • Open

    What Is The Role Of Dataset In Machine Learning?
    Are you aware of the technicalities involved in making Machine Learning models holistic, intuitive, and impactful? If not, you first need…  ( 3 min )
  • Open

    Sign Language to Text , Yes it works
    'Hearing Us’ is basically a Real-time Sign Language translation application.Another Sign Language project which is already available on Youtube and about to get Viral? Not really! This project is built using a State-of-the-Art I3D model on the WLASL Dataset. WLASL is the 𝐥𝐚𝐫𝐠𝐞𝐬𝐭 𝐯𝐢𝐝𝐞𝐨 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐖𝐨𝐫𝐝-𝐋𝐞𝐯𝐞𝐥 𝐀𝐦𝐞𝐫𝐢𝐜𝐚𝐧 𝐒𝐢𝐠𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 (𝐀𝐒𝐋) 𝐫𝐞𝐜𝐨𝐠𝐧𝐢𝐭𝐢𝐨𝐧, 𝐰𝐡𝐢𝐜𝐡 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 2,000 𝐜𝐨𝐦𝐦𝐨𝐧 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐰𝐨𝐫𝐝𝐬 𝐢𝐧 𝐀𝐒𝐋. Therefore, it can predict much more than YES, NO and I LOVE YOU’s! To get to know about further details about our project, you can checkout the PPT that we prepared for the demonstration(link is in the comment ) As a part of this project, we have developed two applications: The first application helps the user to translate the sign language to meaningful sentences in Real-Time. You can check the video to know more about how it works. The second application is a Web App built using streamlit and deployed using Ngrok on Colab. It can be used to predict words or sentences after uploading the video of the Signing. The predicted output is inserted as a caption on the uploaded video which can then be downloaded by the user. A big thanks to Crework for all the mentorship and support which helped us to complete our project in due time. We will now be working on improving our project! Do check it out : https://www.linkedin.com/posts/patel-ayush_signlanguage-deeplearning-crework-activity-6909133517364887552-Z14b?utm_source=linkedin_share&utm_medium=member_desktop_web submitted by /u/patel-ayush [link] [comments]  ( 1 min )
    Bachelor in Machine Learning
    submitted by /u/UpperSpecialist6286 [link] [comments]
    Hi friends,I have questions of friends who have seen to watch five Andrew ng neural network tutorials
    Hi friends,I have questions of friends who have seen to watch five Andrew ng neural network tutorials. First, Why did Mr. Andrew come to teach everything in theory, then in his exercises he asked us to answer the exercises in coding form, if in his training he did not proceed in this way and all the contents of the theory were stated? Second, are Andrew ng neural network exercises all four-choice (□□□□) or should they be coded? Third, Friends who know, please help and how can you completely solve Andrew ng's theory of coding exercises with this tutorial? submitted by /u/numbers222ddd [link] [comments]  ( 1 min )
    Scrabbled Brain - Neural Network
    Hello Experts, This is my first time creating a neural network, and I have got myself in a bit of a muddle. Context: The neural network take 5 co-ordinates taken from a photo - head, waist, knees, feet - (with a graph overlay) and spits out a category out of 4, all gymnastic shapes - tuck, straddle, pike, straight . I have use Categorical Cross-Entropy to work out my neural networks loss, I have then used Gradient Descent as an optimizer - with the derivative from the Categorical Cross-Entropy equation. I have now found myself attempting (quite blindly) to adjust the weights in my neural network and have no clue what value I should use as "error" in the following equation: weight = weight - learning rate * error * input value Should I use the output from Categorical Cross-Entropy or Gradient Descent? Or is there an equation I am missing? ANY notes would be helpful to me? Note: I have linked my code submitted by /u/_Carrot_Sticks_ [link] [comments]  ( 1 min )
    CMU Researchers Introduce a Method for Estimating the Generalization Error of Black-Box Deep Neural Networks With Only Unlabeled Data
    Take a look at the following fascinating observation. On two identically generated datasets S1 and S2 of the same size, train two networks of the same architecture to zero training error. Both networks would have a test error (or, to put it another way, a generalization gap) of about the same size, which is denoted by epsilon. Measure the rate of disagreement of the predicted label between these two networks on a new unlabeled dataset U. This disagreement rate could be anything between 0 and 2 epsilon, based on a triangle inequality. However, studies show that the disagreement rate is not only linearly related to the test error but virtually equals it for various training set sizes and models such as neural networks, kernel SVMs, and decision trees. What is the source of this unique equality? Solving this unsolved problem could lead to the discovery of basic patterns in how neural networks make mistakes. This could help researchers better understand generalization and other empirical phenomena in deep learning. Researchers from Carnegie Mellon University identified a stronger observation first in a recent investigation. Consider two neural networks that were trained with the same hyperparameters and dataset but with different random seeds (for example, the data may be given in various random orders, and/or the network weights could be randomly initialized). Given that both models observe the same data, one would expect the percentage of disagreement to be substantially lower than in previous experiments. Continue Reading Our Summary on This Research Paper: https://arxiv.org/pdf/2106.13799.pdf submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Microsoft has demonstrated the underlying physics required to create a new kind of qubit
    Quantum computing promises to help us solve some of humanity’s greatest challenges. Yet as an industry, we are still in the early days of discovering what’s possible. Today’s quantum computers are enabling researchers to do interesting work. However, these researchers often find themselves limited by the inadequate scale of these systems and are eager to […] The post Microsoft has demonstrated the underlying physics required to create a new kind of qubit appeared first on Microsoft Research.  ( 6 min )
    PeopleLens: Using AI to support social interaction between children who are blind and their peers
    For children born blind, social interaction can be particularly challenging. A child may have difficulty aiming their voice at the person they’re talking to and put their head on their desk instead. Linguistically advanced young people may struggle with maintaining a topic of conversation, talking only about something of interest to them. Most noticeably, many […] The post PeopleLens: Using AI to support social interaction between children who are blind and their peers appeared first on Microsoft Research.  ( 7 min )
  • Open

    7 Great Lightning Talks Related to Data Science Ethics
    I have been organizing and facilitating a series of Ethics Workshops for the Australian Data Science Network, featuring lightning talks by Australian experts on a range of topics related to data science ethics, including machine learning in medicine, explainability, Indigenous-led AI, and the role of policy. Check out the videos from these thought-provoking lightning talks (with longer discussions at the end): The False Hope of Explainability in Medicine Differences between understandings of explainability. Lauren Oakden-Rayner, the Director of Research for Medical Imaging at Royal Adelaide Hospital, is both a radiologist and a machine learning expert. She spoke about mismatched expectations between technical and non-technical communities on what questions explainability answers, base…  ( 3 min )

  • Open

    Why is Data Back-Up Necessary? The Benefits of Availing Technical Support
    A considerable part of the population uses mobile devices and computers to search data. They also store and perform data procedures. However, not everyone is aware of the relevance of data backup. Your crucial data is essential for anything that you do. It is here that technical support companies can help. Whether it’s your desktop… Read More »Why is Data Back-Up Necessary? The Benefits of Availing Technical Support The post Why is Data Back-Up Necessary? The Benefits of Availing Technical Support appeared first on Data Science Central.  ( 3 min )
    Cloud-Based Mobile App Testing: A Complete Guide
    As mobile apps continue to become increasingly intricate offerings, there has been an immense focus on testing these apps too. This is understandable because often mobile apps are the foundation of the business and any issues in the department can deal a major blow to the organization. The testing of a mobile app is a… Read More »Cloud-Based Mobile App Testing: A Complete Guide The post Cloud-Based Mobile App Testing: A Complete Guide appeared first on Data Science Central.  ( 3 min )
    Opening the Pod Bay Door: Regulating multi-purpose AI
    The European Commission published a draft AI Regulation this week that is the world’s first concrete proposal for regulating artificial intelligence (AI). Like GDPR, the Draft AI Regulation is likely to affect AI development globally profoundly. The regulation applies to AI systems, i.e. to any software that can generate content, make predictions, make recommendations, or… Read More »Opening the Pod Bay Door: Regulating multi-purpose AI The post Opening the Pod Bay Door: Regulating multi-purpose AI appeared first on Data Science Central.  ( 2 min )
    Automotive AR and VR — Prototyping in the Virtual World
    Augmented (AR) and virtual reality combine truth with digital content. Automotive businesses are relying on AR & VR to improve their production, maintenance, cockpit individualization, efficient publicity and marketing to improve client satisfaction. Special 3D programs for AR apps are developers who incorporate real-world digital contexts. In order to provide synthetic experience and virtual feedback,… Read More »Automotive AR and VR — Prototyping in the Virtual World The post Automotive AR and VR — Prototyping in the Virtual World appeared first on Data Science Central.  ( 2 min )
    Why do you need a metadata management system? Definition and Benefits.
    In the past, Metadata Management is used to know how to use data catalog to find simple data or a book or a periodical in a library. However, today it is one of the most critical data practices for a successful organization dealing with data. With the rise of distributed architectures, including cloud & big… Read More »Why do you need a metadata management system? Definition and Benefits. The post Why do you need a metadata management system? Definition and Benefits. appeared first on Data Science Central.  ( 3 min )
  • Open

    [D] [P] Some ideas and speculations on the problem of automatic examplar-based photo retouching and action transfer
    Hi guys. I'm a new Deep Learning researcher who just started working at a company that focuses on AI-based image manipulation. The goal is to design an AI-based solution that takes an edited/retouched input image and apply the edits/retouches style of the user and apply it to the rest of the images automatically Example situation: Given a bunch of photos of portrait faces, a software user retouch the first portrait by removing blemishes and pimples with an inpainting tool and some local/global enhancement color adjustment, the algorithm needs to know what and where kind of editing has been made and where the thing to remove with inpainting and apply them to the rest of images. I have to thought of using an agent (policy network) to interact with the editing software and imitate the user behavior (Few-Shot Imitation Learning) on the first edited examples. My questions: What Machine Learning paradigm appears to be suitable for this problem (e.g, Supervised learning, Reinforcement learning). Research direction that might be help full to solve the problem. Papers that did this kind of research on examplar based user style transfer. Any ideas on how to tackle this problem. I wanted to hear what you guys think about it and if you have any idea. submitted by /u/kTonpa [link] [comments]  ( 1 min )
    [D] reporting metrics after training on validation set?
    How frequently have you all suspected people report metrics in papers on the validation set after training on the validation set? submitted by /u/Academic-Sprinkles77 [link] [comments]  ( 1 min )
    [D] Video Paper Explanation - VOS: Learning What You Don't Know by Virtual Outlier Synthesis (Full Walkthrough)
    https://youtu.be/i-J4T3uLC9M Outliers are data points that are highly unlikely to be seen in the training distribution, and therefore deep neural networks have troubles when dealing with them. Many approaches to detecting outliers at inference time have been proposed, but most of them show limited success. This paper presents Virtual Outlier Synthesis, which is a method that pairs synthetic outliers, forged in the latent space, with an energy-based regularization of the network at training time. The result is a deep network that can reliably detect outlier datapoints during inference with minimal overhead. ​ OUTLINE: 0:00 - Intro 2:00 - Sponsor: Assembly AI (Link below) 4:05 - Paper Overview 6:45 - Where do traditional classifiers fail? 11:00 - How object detectors work 17:00 - What are virtual outliers and how are they created? 24:00 - Is this really an appropriate model for outliers? 26:30 - How virtual outliers are used during training 34:00 - Plugging it all together to detect outliers ​ Paper: https://arxiv.org/abs/2202.01197 Code: https://github.com/deeplearning-wisc/vos submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 133
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 132 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/CatalyzeX_code_bot: Paper link Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [P] Bird's-eye views of conference proceedings
    In December, I shared a link to a project I created to get a quick overview of top NeurIPS 2021 papers. The website ordered all accepted papers by average review score and showed an 8-page overview thumbnail for each paper in addition to abstract, link to code, and other meta-data available from OpenReview. I recently wanted to have the same overview for ICLR 2022. I processed the papers and uploaded the overview to https://www.confviews.com/iclr2022/. https://preview.redd.it/jz4usfk2e7n81.png?width=1004&format=png&auto=webp&s=b7493d4d338934f1346589a1596a960c569064d9 I am now using a python client for the OpenReview API, and it's easy to add other conferences from OpenReview. I plan to keep the site updated with the next iterations of NeurIPS and ICLR. Let me know if there are other conferences you would like to have overviews for. Website: https://www.confviews.com Code: https://github.com/tanelp/confviews submitted by /u/tanelai [link] [comments]  ( 1 min )
    [D] GPU requirements for implicit neural representations & neural radiance fields
    Just wanted to ask anyone with experience working in the field of implicit neural represenations regarding the compute requirements you've experienced when developing models. Mainly looking in the domain of neural radiance fields (https://www.matthewtancik.com/nerf). I do have cluster access for evaluating projects that are more mature in the development pipeline, but wanted to gauge if anyone had any advice regarding what has worked when still in earlier development mainly when working on my standalone PC. Thanks so much for any help! submitted by /u/Ungreon [link] [comments]  ( 1 min )
    [D] How to keep up with ML research??
    I heard people read research papers for it, but there are SO MANY of it in the wild! Where do I get started?!? P.S. I'm passionate about Time series and Computer Vision.. so it would be great if you can suggest some good research papers to get started.! thanks in advance!! submitted by /u/ashwin142k [link] [comments]  ( 1 min )
    [News] Analysis of 83 ML competitions in 2021
    I run mlcontests.com, and we aggregate ML competitions across Kaggle and other platforms. We've just finished our analysis of 83 competitions in 2021, and what winners did. Some highlights: Kaggle still dominant with a third of all competitions and half of $2.7m total prize money 67 of the competitions took place on the top 5 platforms (Kaggle, AIcrowd, Tianchi, DrivenData, and Zindi), but there were 8 competitions which took place on platforms which only ran one competition last year. Almost all winners used Python - 1 used C++! 77% of Deep Learning solutions used PyTorch (up from 72% last year) All winning computer vision solutions we found used CNNs All winning NLP solutions we found used Transformers More details here: https://blog.mlcontests.com/p/winning-at-competitive-ml-in-2022? And if you have a second to help me out, I'd love a like/retweet: https://twitter.com/ml_contests/status/1503068888447262721 [Update, since people seem quite interested in this]: there's loads more analysis I'd love to do on this data, but I'm just funding this out of my own pocket right now as I find it interesting and I'm using it to promote my (also free) website. If anyone has any suggestions for ways to fund this, I'll try to do something more in-depth next year. I'd love to see for example: How big a difference was there between #1 and #2 solutions? Can we attribute the 'edge' of the winner to anything in particular in a meaningful way? (data augmentation, feature selection, model architecture, compute power, ...) How representative is the public leaderboard? How much do people tend to overfit to the public subset of the test set? Are there particular techniques that work well to avoid this? Who are the top teams in the industry? Which competitions give the best "return on effort"? (i.e. least competition for a given size prize pool) Which particular techniques work well for particular types of competitions? Very open to suggestions too :) submitted by /u/hcarlens [link] [comments]  ( 2 min )
    [D] Any loss for (discrete) set prediction?
    Dear community, I am working on a problem where I have to predict sets of discrete labels, for example [A, T, D, C]. By definition, I am not interested in the order of these labels so the output can also be [A, D, C, T]. I know that using a cross-entropy between the predicted and ground truth sequence would be wrong, as the cross-entropy cares about the order of the labels. Thus, I'd be interested in applying some losses such as the Hungarian loss: https://preview.redd.it/pbksm28km6n81.png?width=1546&format=png&auto=webp&s=cc05cbaad8b432ec6097b5a96a75d834d22fb662 (From this paper) However, all the definitions I find are about continuous variables. Is there a loss, such as the Hungarian one (i.e. that is not order dependent), for discrete labels? Thank you submitted by /u/marcodena [link] [comments]  ( 1 min )
    [P] Kubis - An easy way to run notebooks on AWS
    Hi everyone, I would like to share a tool called Kubis that we built to run ML notebooks on AWS with zero experience in AWS services and no configuration required. If this is helpful to you, please check us out at kubis.ai Kubis got started because we were working on a research project and found ourselves in a position where our local computers were too limited. At the same time getting started with AWS felt complex as we weren’t software engineers. In our case it made more sense to use cloud and we built a notebook that works like Colab but lets us scale quickly to larger machines on AWS. Posting it here to see if the community finds it useful. Critique is welcomed :) Thanks submitted by /u/Academic_Arrak [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [D] ICML 2022 Phase-1 Results Are Out
    The ICML 2022 phase-1 results are out today. How does everyone feel about the quality of the received reviews and the phase-1 decisions? submitted by /u/Right_Presentation_3 [link] [comments]  ( 2 min )
    [R] Forecasting: theory and practice
    Link: https://arxiv.org/abs/2012.03854 Abstract: "Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases. " submitted by /u/bikeskata [link] [comments]  ( 1 min )
    [P] Gorse: An open-source recommender system service
    Recommender systems have been widely applied to many online services on the Internet. Although there have been lots of open source libraries, it is still hard to build a recommender system from scratch. I have maintained a project called Gorse for a long time. ​ Gorse aims to be a universal open-source recommender system that can be quickly introduced into a wide variety of online services. By importing items, users, and interaction data into Gorse, the system will automatically train models to generate recommendations for each user. ​ GitHub Repo: https://github.com/zhenghaoz/gorse https://preview.redd.it/sfjgsfmz55n81.png?width=2480&format=png&auto=webp&s=132ae61502fb9ee13e4f757078576285fc43e702 To demonstrate the usability of the Gorse recommender system engine, I have built a live demo called GitRec, which recommends repositories to GitHub users based on starred repositories. ​ Demo: https://gitrec.gorse.io/ Source: https://github.com/zhenghaoz/gitrec https://preview.redd.it/wk0zjhqu55n81.png?width=1044&format=png&auto=webp&s=acde1affca91bc98755953e34986b09acd9492c9 submitted by /u/zhenghaoz [link] [comments]  ( 1 min )
    [P] Looking to form a deep learning in medical imaging reading group
    Hello there. I'm not sure about the tag, but I wanted to post this here, since I think there are probably a lot of people who are working on medical imaging as of now. I'm planning to create a discord server about discussing machine learning in medical imaging, in which we talk about different papers, their implementations, and overall talk a lot about the possibilities and ideas. There are tons of great discussion about these kinds of issues on this sub, but I thought maybe a focused discussion server will prove beneficial too. Since I don't want to spam here and the server is not public, if anyone was interested, please dm me. submitted by /u/feryet [link] [comments]  ( 1 min )
    [D] Will Attention Based Architecture / Transformers Take Over Artificial Intelligence?
    A well popularized article in Quanta magazine ask the question « Will Transformers Take Over Artificial Intelligence? ». Since having revolutionized NLP, attention is conquering computer vision and reinforcement learning. I find pretty unfortunate that the attention mechanism was totally eclipsed by Transformers which is just a funny name (animation movie/ toy) for self-attention architecture, although the Google's paper title on Transformers was «Attention is all you need». submitted by /u/ClaudeCoulombe [link] [comments]  ( 1 min )
    [N] Karl Friston "Theory of Mind"
    submitted by /u/meldiwin [link] [comments]  ( 2 min )
    [D] How to stay up to date on ML trends as someone with low technical knowledge?
    I work in the computer science education space and find myself falling behind in my AI knowledge since leaving industry. I have written some ML applications in the past and took some classes for my degrees, but I find it's hard to keep up nowadays. Are there any good newsletters or accessible sites/blogs that can help me understand emerging trends and keep my knowledge up to date? I find a lot of technical articles and blogs overwhelming. Appreciate your time! submitted by /u/raw-shucked-oysters [link] [comments]  ( 1 min )
    AI research and Mathematics [D]
    (New to this so going to sound naive) I'm a senior undergrad student of Electrical Engineering. I have some level of interest in Machine Learning, my Final Year project being based upon it and all. But most of my interest was for the mathematics behind Machine Learning and AI. And most of the ML projects are just programming on keras and stuff. Like there can be maths involved here, just not the heavy kind like we learn in theory, so is there usually much research going on under AI making or refining mathematical algorithms for AI, instead of just using AI and ML for tasks, and if so what field. Also, I saw some researchers working on making more Biologically inspired Neural Networks, is it more in the AI domain, or more neuroscience domain, and if the latter, can a EE graduate enter this field? And is it maths extensive, like modelling neurons using Control theory and stuff? submitted by /u/a_khalid1999 [link] [comments]  ( 8 min )
    [P] A.I. Learns to Drive From Scratch in Trackmania - Great introduction and demonstration of reinforcement learning with Deep-Q-Learning
    submitted by /u/kris33 [link] [comments]
  • Open

    Frequency shift keying (FSK) spectrum
    This post will look encoding digital data as an analog signal using frequency shift keying (FSK), first directly and then with windowing. We’ll look at the spectrum of the encoded signal and show that basic FSK uses much less bandwidth than direct encoding, but more bandwidth than FSK with windowing. Square waves The most natural […] Frequency shift keying (FSK) spectrum first appeared on John D. Cook.  ( 3 min )
  • Open

    Artificial Nightmares : The Kraken's Fury || Clip Guided Disco Diffusion AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Stanford Researchers Apply a Combination of Autonomous Drone Technology With Scientific Machine Learning To Find How Fast Will Antarctica’s Ice Sheet Melt and Reduce The Uncertainty of Sea-Level Rise
    Ice protects the Earth layer and its oceans by acting as a shield. Excess heat is reflected into space by these dazzling white spots, keeping the Earth cold. Many glaciers throughout the world have been melting quickly since the early 1900s. Human actions cause this phenomenon. Carbon dioxide (CO2) and other greenhouse gas emissions have elevated temperatures since the industrial revolution. Melting glaciers are a contributing factor in rising sea levels, which leads to an increase of coastal erosion and storm surge. Warmer air temperatures lead directly into more frequent storms like hurricanes or typhoons with stronger winds that cause even greater damage on land. Many cities are already planning to deal with long-term flooding, which may carry salt and moisture into houses and infrastructure, jeopardize drinking water and agriculture, and severely damaged ports. Given the gravity of the problem, it is critical to understand how much and how quickly sea levels will rise. The projections in the existing predictive models made by scientists are pretty uncertain. Since the contribution from the southernmost continent is so unknown, governments worldwide must consider an unlimited number of scenarios when planning for the future. A group of Stanford University scientists employed autonomous drone technology and machine learning approach to focus their efforts on discovering and gathering the most valuable data in Antarctica to increase our understanding of the processes that drive sea-level rise. Continue Reading Our Summary on This Research From Stanford or checkout the HAI Report submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Important Video Series about AGI - How to build it, and why is it dangerous
    Hey My Reddit Fellows, I just wanted to share a video series I am making about AGI, how to manage AGI Safety, and what the post singularity society will look like. Please subscribe to my channel, and let me know if you have any feedback and what topics you would like to see next! ►Playlist: https://youtube.com/playlist?list=PLb4nW1gtGNse4PA_T4FlgzU0otEfpB1q1 ►AGI Existential Threat: https://youtu.be/V4iQP7VDMvI ►Life 3.0: https://youtu.be/aWlSwZKzmzY ►Dangers of AGI Sub Goals: https://youtu.be/_-tQH03rq4g ►How to Create an AGI: https://youtu.be/7OHhqli9oaA Thank you! Bill submitted by /u/billgggggg [link] [comments]  ( 2 min )
    Artificial Intelligence Stocks: The 10 Best AI Companies
    submitted by /u/crazygrumpy [link] [comments]
  • Open

    Some questions about rollout algorithm
    As described in Sutton's RL book: " Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state. They estimate action values for a given policy by averaging the returns of many simulated trajectories that start with each possible action and then follow the given policy. When the action-value estimates are considered to be accurate enough, the action (or one of the actions) having the highest estimated value is executed, after which the process is carried out anew from the resulting next state." My questions 1. Since we are following a given policy and simulating begins with a given state. How come there will be different trajectories in a deterministic environment? 2. How do we do rollout in continuous space? submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
    OpenAI gym: Lunar Lander V2 Question
    Hi, I am trying to train an RL agent to solve the Lunar Lander V2 environment. However, for a simple DQN as well as a PPO controller I continue to see a situation that after some learning, the lander starts to just hover in a high position. Is this a common observation, or do I have to assume that my implementation is wrong? From the documentation, I see no negative reward for just hoovering... ​ Thanks for any enlightening comment, Dito submitted by /u/ditomax [link] [comments]  ( 1 min )
    OpenAI Gym - Variable action space in a custom enviroment, unknown upper bound
    Hello, I'm wanting to make a custom environment in openAI gym. My problem is the action space varies depending on the state, and I don't know if I can compute (without brute-forcing it across every state) the max. I saw one solution was to implement a negative reward if an impossible action was selected, but I'm not sure if that is practial for my use case. https://www.reddit.com/r/reinforcementlearning/comments/rlixoo/openai_gym_custom_environments_dynamically/ Is there a way to do this or should I just go for an action space larger than practical and hope it never exceeds that? submitted by /u/snaredrum_merchant [link] [comments]  ( 1 min )
    Skills To Learn Before Starting Reinforcement Learning(RL)
    Hey, I am new to RL. I was fascinated by the concept of reinforcement learning. But now when it comes to learning RL I would like to know skills required to to start RL and the road map to RL. I would also like to know if Dynamic Programming is an important skill to learn RL or not. Thank you. submitted by /u/Rishh3112 [link] [comments]  ( 2 min )
  • Open

    A.I. Learns to Drive From Scratch in Trackmania - Great introduction and demonstration of reinforcement learning with Deep-Q-Learning
    submitted by /u/kris33 [link] [comments]

  • Open

    [Discussion] Desperately need advice for binary SVC using Matlab
    It's leading up to an important deadline for me and I really need to get these results working by the end of this week. I am an extreme novice with ML, but due to issues with my master thesis I've had to start using it to try and salvage it. I have my sampled dataset, I have my Matlab script all set up, and it works for example datasets etc. But I can't seem to get a good result from my data set and I have exhausted a lot of options over the past couple weeks. It's sort of driving me insane because I can't see why it's not working. To summarise, the boundary of the safe and fail classes is very oscillatory (almost sinusoidal, but only peaks and veeery thin peaks at that, and the peaks themselves fluctuate in height). I imagined it should be possible to get a good result out of this dataset as the boundary is quite well defined. I've been using a Gaussian kernel, and when I sample only a small corner of the dataset (e.g imagine only one or two of the "peaks" are in this sample space), it can define the boundary quite well . But once I give the entire sample space to study, it just gets muddled and either gives me a linear looking boundary or just pure garbage... Its probably important to note the dataset is also imbalanced, but I have played around with weights and down-sampling, but it hasn't really helped. If anyone is an expert, particularly with Matlab, and willing to have a quick conversation with me about where I might be going wrong (kernel selection, data sampling, hyper parameter optimisation method, etc) I'd greatly appreciate it!! I would upload an image of my dataset, but want to avoid potential issues and whatnot when I submit my thesis. Unfortunately I will be going to sleep soon so will check for replies in the morning, but I appreciate any input immensely. submitted by /u/KitKatMMD [link] [comments]  ( 1 min )
    [R] A Neural Programming Language for the Reservoir Computer
    submitted by /u/shitboots [link] [comments]
    [P] Can anyone help me to understand how MoG works for background video subtraction?
    I'm just trying to understand it conceptually. It's background subtraction in a video using mixture of gaussians. So basically, we have to model each pixel in the training frames (video of a road) as a mixture of gaussians, later a car drives across and we have to just get that foreground image out of it. I don't understand what the "distribution of each pixel" means. So if i take the (50,50) pixel out of the first 100 frames of the video (it's gray scale), I could get a histogram out of those 100 gray scale values, is that what I'm trying to model? Like model the histogram as N gaussian distributions, like N different clumps of 0-255 values as gaussian distributions to represent that pixel? Or am i completely off. If anyone can help explain in simple terms, I'm not worried about the implementation itself more just what the gaussian distributions represents. Like if I was just doing K-means (hard assignments), what would each "cluster" be in this case? submitted by /u/jac-128514 [link] [comments]  ( 1 min )
    [D] Are memristors and/or analogue circuits helping to improve ML/AI?
    There was a big thing a few years ago on how Memristors were going to change computing and their impact in ML could be huge, are they having an impact and if so how big? It sounds like analogue circuits driven chips could be the next big wave in AI how do they compare to high end GPUs? submitted by /u/Arowx [link] [comments]  ( 1 min )
    [R] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
    submitted by /u/Wiskkey [link] [comments]
    [D] Is it possible for us to make fixed-size multilayer perceptrons (MLP's) provably converge?
    I recently had a long conversation with Tim Scarfe and Keith Duggar on Machine Learning Street Talk (MLST) about theory related to neural networks. I really believe we can make better machine learning algorithms and better guarantees if we uncover the right theoretical track. I would really appreciate hearing from you all in the community about this work, so I've written up this post to accompany the MLST video. Enjoy! See the full interactive version of this post on my research page here. Get the code for these experiments here. Data Distributions and Initializing Neural Networks Is it possible for us to make fixed-size multilayer perceptrons (MLP's) provably converge? It's been bothering me that initialization seems arbitrary and all the optimization algorithms produce different resu…  ( 4 min )
    [P] Generating video from compressed/data lost input with lowest differance from the original
    Hello dear r/MachineLearning fellows, I have an idea which has a following requirement. With given video (mp4 for being simple, twitch clips of gamers for example) I want to generate a compressed (lossy or lossless), downscaled and/or data lost (corrupt) version (or various combinations of above methods) Using this low sized video and combination of upscaling/synthetic replacement (for lossy compression and corruption) I want to generate a video close to the original. The "closeness" is not visual but bit/byte closeness. So my aim is to generate a video with lowest bits of differance (diff of bytes) Most papers on this topic are about generating visually compelling outputs. So I would like to ask how to approach testing for which methods and combination percentages (like %20 downscale %10 compress) would be more beneficial to generate low diff (on byte level) with least data on input video. Also I have a small dataset of 100 videos and I am afraid of overfitting the result, so testing for generalization would be a requirement aswell. submitted by /u/oDeathwingo [link] [comments]  ( 1 min )
    [P] Time-series Benchmark methods that are Simple and Probabilistic
    Naive/Benchmark time series methods have many uses business applications. tablespoon makes generating these naive methods easy while taking advantage of Stan's efficient No U-Turn Sampler - much the same way Facebook Prophet it built on top of Stan. 😻 Github 😻 🥄 Project Site 🥄 ⭐Please star if you like the project.⭐ submitted by /u/Legitimate-Stay-5131 [link] [comments]  ( 1 min )
    [N] ImageNet: How a UK TV Cook ended up as 'slut' in an influential image database - Johannes Filter
    submitted by /u/filt_er [link] [comments]  ( 3 min )
    [P] Annotated implementation of the Retrieval-Enhanced Transformer
    https://nn.labml.ai/transformers/retro/model.html This is an annotated (side-by-side notes) implementation of RETRO in PyTorch. Retrieval Enhanced Transformer (RETRO) is 25X smaller than GPT-3 but has comparable performance. It uses chunks of similar text retrieved based on a frozen BERT model from a massive database (5 trillion tokens) to improve the performance of the model. Since the model can retrieve information from this large database it doesn't have to contain all the facts in the model weights. submitted by /u/mlvpj [link] [comments]  ( 1 min )
    [D] Open Discussion about Facial Expression/Motion Capture
    it would be so great and helpful to hear some opinions from someone as a professional in this field! The questions are as follows: What are the barriers of real-time facial expression tracking system based on 2D data? What are the advantages and disadvantages of facial expression capture based on 3D data (with depth camera or other techniques)? In the long term, what facial motion capture method/company do you think the most highly of and why? Which human body model (skeleton-based model, contour-based model, volume-based model) do you think will most probably become the mainstream model in the future? What do you think of the top-down approach and bottom-up approach to Human Pose Estimation? What method/approach do you think is the most suitable for facial expression capture specifically? Let's have an open discussion guys! Please feel free to share any of your thoughts or throw more questions on facial motion capture. Thanks a lot in advance for sharing your thoughts and time! Much appreciated! submitted by /u/pawz_up [link] [comments]  ( 1 min )
    [N] Google Colab Pro and Pro+ are newly available in these 10 countries: Ireland, Israel, Italy, Morocco, the Netherlands, Poland, Spain, Switzerland, Turkey, and the United Arab Emirates
    The news was announced here. submitted by /u/Wiskkey [link] [comments]  ( 1 min )
    [P] Web app "informative-drawings" on site Hugging Face quickly creates a line art drawing of an input image. Link in a comment.
    submitted by /u/Wiskkey [link] [comments]  ( 1 min )
    [D] What’re some topics from “pure math” that you’ve found use for in your work/research?
    By “pure math”, I’m referring to the particularly esoteric concepts that someone without a math degree or someone not doing research in a math-intensive field would likely never encounter. For example, topography, abstract algebra, field theory, differential geometry, number theory, set theory, ergodic theory, and perhaps even complex analysis. And to what degree did you have to use them? Was it to the point where you wished you had prior exposure to the subject in a classroom setting, or was it manageable and self-learnable with relative ease? submitted by /u/mowa0199 [link] [comments]  ( 2 min )
    [R][P] Investigating Tradeoffs in Real-World Video Super-Resolution + Hugging Face Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] [P] Some ideas and speculations on the problem of efficiently processing length-irregular data
    Hi folks. I'm a new DL researcher who just began working at a startup that focuses on AI-based drug discovery. I'm afraid this is not the most suitable place to post this since this is more of an engineering idea, but I wanted to hear what you guys think about it and if you have any idea. I don't know how many of you have encountered the same efficiency issue before, but I've repeatedly come across this theme while implementing my research ideas: I have a dataset that consists of datapoints with non-uniform length along some dimensions (number of atoms in a molecule, number of amino acids in a protein etc.), and I want to perform numerical calculations (e.g. feed into a DL model) on a tensorized-batched form of them. The batching (turning them into a single tensor) would be an indispensi…  ( 4 min )
    [D] Where/when is EMNLP 2022
    I see other conferences like ACL, NAACL, and COLING in this list, but EMNLP is in TBA. I don't see when the submission deadline for that conference is at? Also I see EMNLP workshops being announced, but I still don't see when is the main conference. If anyone has any idea if EMNLP conference got cancelled or when is it happening, I would appreciate it. submitted by /u/SiegeMemeLord [link] [comments]  ( 1 min )
  • Open

    Is there a list with all AIs out there?
    submitted by /u/TheblackRook3 [link] [comments]
    Amazon Uses Machine Learning to Improve Video Quality on Prime Video
    Because streaming video might be harmed by flaws introduced during recording, encoding, packing, or transmission, most subscription video services, such as Amazon Prime Video, monitor the quality of the content they stream regularly. Manual content review, often known as eyes-on-glass testing, doesn’t scale well and comes with its own set of issues, such as discrepancies in reviewers’ quality judgments. The use of digital signal processing to detect anomalies in the video signal, which are typically associated with faults, is becoming more popular in the business. To validate new program releases or offline modifications to encoding profiles, Prime Video’s Video Quality Analysis (VQA) division began employing machine learning three years ago to discover faults in collected footage from devices such as consoles, TVs, and set-top boxes. More recently, Amazon has used the same techniques to solve problems like real-time quality monitoring of our thousands of channels and live events, as well as large-scale content analysis. The Amazon team at VQA trains computer vision models to watch a video and detect flaws like blocky frames, unexpected dark frames, and audio noise that could degrade the customer-watching experience. This allows Amazon to process video on a massive scale, allowing them to process hundreds of thousands of live events and catalog items. Continue Reading Our Summary on This Research https://i.redd.it/vfx8iiiopzm81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    10 Best R Books for Data Science beginners & advanced to read in 2022 -
    submitted by /u/sivasiriyapureddy [link] [comments]
    Microsoft’s Latest Machine Learning Research Introduces μTransfer: A New Technique That Can Tune The 6.7 Billion Parameter GPT-3 Model Using Only 7% Of The Pretraining Compute
    Scientists conduct trial and error procedures which experimenting, that many times lear to freat scientific breakthroughs. Similarly, foundational research provides for developing large-scale AI systems theoretical insights that reduce the amount of trial and error required and can be very cost-effective. Microsoft team tunes massive neural networks that are too expensive to train several times. For this, they employed a specific parameterization that maintains appropriate hyperparameters across varied model sizes. The used µ-Parametrization (or µP, pronounced “myu-P”) is a unique way to learn all features in the infinite-width limit. The researchers collaborated with the OpenAI team to test the method’s practical benefit on various realistic cases. Studies have shown that training large neural networks because their behavior changes as they grow in size are uncertain. Many works suggest heuristics that attempt to maintain consistency in the activation scales at initialization. However, as training progresses, this uniformity breaks off at various model widths. CONTINUE READING MY SUMMARY ON THIS RESEARCH Paper: https://www.microsoft.com/en-us/research/uploads/prod/2021/11/TP5.pdf Github:https://github.com/microsoft/mup https://i.redd.it/gmn30ut8wvm81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    This Month in Artificial Intelligence News
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Game changer for virtual avatars and social robots: this AI model can very accurately transfer your body motion from a video (speech video) to the virtual world! 🤩 🤯
    submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
  • Open

    Training Multiple Robots for different tasks at the same time using Deep Reinforcement Learning
    Hi, I'm new to RL and just wondering if a single agent can train multiple robots to perform different tasks simultaneously. If possible, can you please recommend me some research papers and implementations that I can take a look at? I was hoping we could create two gym environments like env1 = gym.make("robot1-task1") and env2 = gym.make("robot2-task2") then use a single agent to sample experience from both environments at the same time and generalize for both tasks. submitted by /u/ncbdrck [link] [comments]  ( 1 min )
    Finding global optimum of an unknown high-dimensional objective function with Reinforcement Learning
    Hi everyone, I'm a newbie in Reinforcement Learning, but I have read a lot of scientific papers, but I'm really not sure if I am on the right track. My problem is, that I'd like to find the optimal parameters that maximize the output value. The output value is computed by a simulation tool - and the simulation is feeded by 200+ input parameters. So you can basically say, that the problem is to optimize an unknown objective function. ​ Because of that high-dimensionality, the way to go will be probably with policy-gradient RL, or even better actor-critic RL. As I do not have to explore large parts of the space - I only want to find the optimium - I'm guessing that an on-policy algorithm as for example PPO should be the way to go. ​ Am I right that using PPO could solve my problem or am I totally on the wrong track? Thank you very much in advance for your help, I would be really grateful! submitted by /u/wilkules [link] [comments]  ( 1 min )
    Parallelization with different robotics learning environments.--Isaac gym comparison.
    I have recently started using Isaac gym for simulating robots for deep reinforcement learning. The ability to simulate thousands of robots at once on a normal GPU feels like a total game changer. It seems strange to me that there are still only a handful of teams using it. It then occurred to me that perhaps other physics environments are actual also capable of similar performance if you know how to program them correctly. GPU physics has existed of a long time so it makes sense that there would be other simulators with excellent parallelizability. so my question is: how real is the Isaac gym speed up? Is it something that can be copied by other simulators? If its so much faster why aren't more people using it ? submitted by /u/sanddollarbeach [link] [comments]  ( 1 min )
    I need help about this, https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf
    submitted by /u/Professional_Card176 [link] [comments]
    Questions about the upper bound of overestimation in Q Learning
    I am forcing myself to understand the whole thing in the paper "Deep Reinforcement Learning with Double Q-learning" before code it out. In the paper "Issues in Using Function Approximation for Reinforcement Learning", I really can't understand the second line of the equation in the Appendix: Proofs section that try to derive the upper bound of the overestimation, what I can understand is it change the expectation to integral and combine a derivative of a function. As I cant understand the derivation of the upper bound, I am not trying to understand the derivation of the lower bound, or else I can't sleep at all, and also why there is no YouTube video explain about this? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    An example of an RL problem
    I started learning RL about 6 months ago, and I learned the basic definitions of RL, such as agent, policy, reward, action value function, state-action value function, etc. However, I am still kind of feeling my knowledge is not enough, since I couldn’t find an example of an RL problem for which these concepts are defined and explained in details. For example, I know that the state-action value function specifies how good it is for an agent to perform a particular action in a state with a policy π, however, I don’t know how we could define this function for a simple RL example. Does anyone have a thorough example of an RL problem in which basic concepts of RL are defined and explained? submitted by /u/nimageran [link] [comments]  ( 2 min )
    How are rewards/scores calculated in openai Gym's Atari Skiing-v0?
    Hi guys! Was wondering if anyone knows how Atari gym skiing-v0 is rewarded? It seems like: small negative rewards (~-1 to -10) are given on every timestep. a huge negative reward (-1000 to -10000) is given at the termination of the game. But what is the criteria for the huge negative reward -- is it the number of times you bump into trees? the number of flags you missed? And what is the criterion for termination (timestep of the huge negative reward)? I can't access the raw code of the skiing-v0 (can't find it on github openai gym) so I have been wracking my brain trying to dissect this. Thanks in advance for the help! :) submitted by /u/sunchipsster [link] [comments]  ( 1 min )
    I need help about Double DQN
    I already understand the algorithm of DDQN, but I can't really understand the section "Overoptimism due to estimation errors", is there any math topic I need to learn to understand it? submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    DeepMind Ithaca - Restoring and attributing ancient texts using deep neural networks
    submitted by /u/binaryfor [link] [comments]
    hey guys a deep learning novice here, I can't really get my head around what could be the cause of the sharp drops in error around epoch 25 and epoch 60. if anyone could help it's much appreciated. thanks
    submitted by /u/ma7modbasha [link] [comments]  ( 2 min )
    Microsoft’s Latest Machine Learning Research Introduces μTransfer: A New Technique That Can Tune The 6.7 Billion Parameter GPT-3 Model Using Only 7% Of The Pretraining Compute
    Scientists conduct trial and error procedures which experimenting, that many times lear to freat scientific breakthroughs. Similarly, foundational research provides for developing large-scale AI systems theoretical insights that reduce the amount of trial and error required and can be very cost-effective. Microsoft team tunes massive neural networks that are too expensive to train several times. For this, they employed a specific parameterization that maintains appropriate hyperparameters across varied model sizes. The used µ-Parametrization (or µP, pronounced “myu-P”) is a unique way to learn all features in the infinite-width limit. The researchers collaborated with the OpenAI team to test the method’s practical benefit on various realistic cases. Studies have shown that training large neural networks because their behavior changes as they grow in size are uncertain. Many works suggest heuristics that attempt to maintain consistency in the activation scales at initialization. However, as training progresses, this uniformity breaks off at various model widths. CONTINUE READING MY SUMMARY ON THIS RESEARCH Paper: https://www.microsoft.com/en-us/research/uploads/prod/2021/11/TP5.pdf Github:https://github.com/microsoft/mup https://i.redd.it/wu93hpd7wvm81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    When Good Data Goes Bad
    You’ve all heard the saying “Garbage in, garbage out” and probably have your own horror story about wrestling data to the ground over missing values, inconsistent codes, or variable formats.  It is hard to argue that you don’t need better data quality to make better decisions, but is better data quality enough? If the formats… Read More »When Good Data Goes Bad The post When Good Data Goes Bad appeared first on Data Science Central.  ( 6 min )
  • Open

    Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. (arXiv:2111.15640v3 [cs.CV] UPDATED)
    Diffusion probabilistic models (DPMs) have achieved remarkable quality in image generation that rivals GANs'. But unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks. This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding. Our key idea is to use a learnable encoder for discovering the high-level semantics, and a DPM as the decoder for modeling the remaining stochastic variations. Our method can encode any image into a two-part latent code, where the first part is semantically meaningful and linear, and the second part captures stochastic details, allowing near-exact reconstruction. This capability enables challenging applications that currently foil GAN-based methods, such as attribute manipulation on real images. We also show that this two-level encoding improves denoising efficiency and naturally facilitates various downstream tasks including few-shot conditional sampling. Please visit our project page: https://Diff-AE.github.io/  ( 2 min )
    SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors. (arXiv:2203.05523v1 [cs.AR])
    Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.  ( 2 min )
    Access Control of Object Detection Models Using Encrypted Feature Maps. (arXiv:2202.00265v2 [cs.CV] UPDATED)
    In this paper, we propose an access control method for object detection models. The use of encrypted images or encrypted feature maps has been demonstrated to be effective in access control of models from unauthorized access. However, the effectiveness of the approach has been confirmed in only image classification models and semantic segmentation models, but not in object detection models. In this paper, the use of encrypted feature maps is shown to be effective in access control of object detection models for the first time.  ( 2 min )
    Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure. (arXiv:2104.11667v2 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely black-box. The data may have some known structure (e.g. symmetries) and/or the data generation process can yield useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure or auxiliary information. Instead, we propose performing BO on complex, structured problems by using deep learning models with uncertainty, a class of scalable surrogate models that have the representation power and flexibility to handle structured data and exploit auxiliary information. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.  ( 2 min )
    TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions. (arXiv:2110.02195v2 [cs.LG] UPDATED)
    We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both $v^\star$ and $q^\star$ lie in the linear span when restricted to the states reachable from the starting state. Recently, Weisz et al. (2021b) showed that under (ii) the minimax query complexity of any planning algorithm is at least exponential in the horizon $H$ or in the feature dimension $d$ when the size $A$ of the action set can be chosen to be exponential in $\min(d,H)$. On the other hand, for the setting (i), Weisz et al. (2021a) introduced TensorPlan, a planner whose query cost is polynomial in all relevant quantities when the number of actions is fixed. Among other things, these two works left open the question whether polynomial query complexity is possible when $A$ is subexponential in $\min(d,H)$. In this paper we answer this question in the negative: we show that an exponentially large lower bound holds when $A=\Omega(\min(d^{1/4},H^{1/2}))$, under either (i), (ii) or (iii). In particular, this implies a perhaps surprising exponential separation of query complexity compared to the work of Du et al. (2021) who prove a polynomial upper bound when (iii) holds for all states. Furthermore, we show that the upper bound of TensorPlan can be extended to hold under (iii) and, for MDPs with deterministic transitions and stochastic rewards, also under (ii).  ( 3 min )
    Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. (arXiv:2202.06767v2 [cs.CV] UPDATED)
    Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a deep benchmarking of different downstream tasks are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, our model achieves an average accuracy of 73.03%. For the image-text retrieval task,our model achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than the result of WenLan 2.0. More information can refer to https://wukong-dataset.github.io/wukong-dataset/.  ( 2 min )
    AD-GAN: End-to-end Unsupervised Nuclei Segmentation with Aligned Disentangling Training. (arXiv:2107.11022v2 [eess.IV] UPDATED)
    We consider unsupervised cell nuclei segmentation in this paper. Exploiting the recently-proposed unpaired image-to-image translation between cell nuclei images and randomly synthetic masks, existing approaches, e.g., CycleGAN, have achieved encouraging results. However, these methods usually take a two-stage pipeline and fail to learn end-to-end in cell nuclei images. More seriously, they could lead to the lossy transformation problem, i.e., the content inconsistency between the original images and the corresponding segmentation output. To address these limitations, we propose a novel end-to-end unsupervised framework called Aligned Disentangling Generative Adversarial Network (AD-GAN). Distinctively, AD-GAN introduces representation disentanglement to separate content representation (the underling spatial structure) from style representation (the rendering of the structure). With this framework, spatial structure can be preserved explicitly, enabling a significant reduction of macro-level lossy transformation. We also propose a novel training algorithm able to align the disentangled content in the latent space to reduce micro-level lossy transformation. Evaluations on real-world 2D and 3D datasets show that AD-GAN substantially outperforms the other comparison methods and the professional software both quantitatively and qualitatively. Specifically, the proposed AD-GAN leads to significant improvement over the current best unsupervised methods by an average 17.8% relatively (w.r.t. the metric DICE) on four cell nuclei datasets. As an unsupervised method, AD-GAN even performs competitive with the best supervised models, taking a further leap towards end-to-end unsupervised nuclei segmentation.  ( 2 min )
    Targeted Data Poisoning Attack on News Recommendation System by Content Perturbation. (arXiv:2203.03560v2 [cs.CR] UPDATED)
    News Recommendation System(NRS) has become a fundamental technology to many online news services. Meanwhile, several studies show that recommendation systems(RS) are vulnerable to data poisoning attacks, and the attackers have the ability to mislead the system to perform as their desires. A widely studied attack approach, injecting fake users, can be applied on the NRS when the NRS is treated the same as the other systems whose items are fixed. However, in the NRS, as each item (i.e. news) is more informative, we propose a novel approach to poison the NRS, which is to perturb contents of some browsed news that results in the manipulation of the rank of the target news. Intuitively, an attack is useless if it is highly likely to be caught, i.e., exposed. To address this, we introduce a notion of the exposure risk and propose a novel problem of attacking a history news dataset by means of perturbations where the goal is to maximize the manipulation of the target news rank while keeping the risk of exposure under a given budget. We design a reinforcement learning framework, called TDP-CP, which contains a two-stage hierarchical model to reduce the searching space. Meanwhile, influence estimation is also applied to save the time on retraining the NRS for rewards. We test the performance of TDP-CP under three NRSs and on different target news. Our experiments show that TDP-CP can increase the rank of the target news successfully with a limited exposure budget.  ( 2 min )
    Differentially Describing Groups of Graphs. (arXiv:2201.04064v2 [cs.SI] UPDATED)
    How does neural connectivity in autistic children differ from neural connectivity in healthy children or autistic youths? What patterns in global trade networks are shared across classes of goods, and how do these patterns change over time? Answering questions like these requires us to differentially describe groups of graphs: Given a set of graphs and a partition of these graphs into groups, discover what graphs in one group have in common, how they systematically differ from graphs in other groups, and how multiple groups of graphs are related. We refer to this task as graph group analysis, which seeks to describe similarities and differences between graph groups by means of statistically significant subgraphs. To perform graph group analysis, we introduce Gragra, which uses maximum entropy modeling to identify a non-redundant set of subgraphs with statistically significant associations to one or more graph groups. Through an extensive set of experiments on a wide range of synthetic and real-world graph groups, we confirm that Gragra works well in practice.  ( 2 min )
    Parsimonious Random Projection Neural Networks for the Numerical Solution of Initial-Value Problems of ODEs and index-1 DAEs. (arXiv:2203.05337v1 [math.NA])
    We address a physics-informed neural network based on the concept of random projections for the numerical solution of IVPs of nonlinear ODEs in linear-implicit form and index-1 DAEs, which may also arise from the spatial discretization of PDEs. The scheme has a single hidden layer with appropriately randomly parametrized Gaussian kernels and a linear output layer, while the internal weights are fixed to ones. The unknown weights between the hidden and output layer are computed by Newton's iterations, using the Moore-Penrose pseudoinverse for low to medium, and sparse QR decomposition with regularization for medium to large scale systems. To deal with stiffness and sharp gradients, we propose a variable step size scheme for adjusting the interval of integration and address a continuation method for providing good initial guesses for the Newton iterations. Based on previous works on random projections, we prove the approximation capability of the scheme for ODEs in the canonical form and index-1 DAEs in the semiexplicit form. The optimal bounds of the uniform distribution are parsimoniously chosen based on the bias-variance trade-off. The performance of the scheme is assessed through seven benchmark problems: four index-1 DAEs, the Robertson model, a model of five DAEs describing the motion of a bead, a model of six DAEs describing a power discharge control problem, the chemical Akzo Nobel problem and three stiff problems, the Belousov-Zhabotinsky, the Allen-Cahn PDE and the Kuramoto-Sivashinsky PDE. The efficiency of the scheme is compared with three solvers ode23t, ode23s, ode15s of the MATLAB ODE suite. Our results show that the proposed scheme outperforms the stiff solvers in several cases, especially in regimes where high stiffness or sharp gradients arise in terms of numerical accuracy, while the computational costs are for any practical purposes comparable.  ( 3 min )
    Skip Vectors for RDF Data: Extraction Based on the Complexity of Feature Patterns. (arXiv:2201.01996v3 [cs.LG] UPDATED)
    The Resource Description Framework (RDF) is a framework for describing metadata, such as attributes and relationships of resources on the Web. Machine learning tasks for RDF graphs adopt three methods: (i) support vector machines (SVMs) with RDF graph kernels, (ii) RDF graph embeddings, and (iii) relational graph convolutional networks. In this paper, we propose a novel feature vector (called a Skip vector) that represents some features of each resource in an RDF graph by extracting various combinations of neighboring edges and nodes. In order to make the Skip vector low-dimensional, we select important features for classification tasks based on the information gain ratio of each feature. The classification tasks can be performed by applying the low-dimensional Skip vector of each resource to conventional machine learning algorithms, such as SVMs, the k-nearest neighbors method, neural networks, random forests, and AdaBoost. In our evaluation experiments with RDF data, such as Wikidata, DBpedia, and YAGO, we compare our method with RDF graph kernels in an SVM. We also compare our method with the two approaches: RDF graph embeddings such as RDF2vec and relational graph convolutional networks on the AIFB, MUTAG, BGS, and AM benchmarks.  ( 2 min )
    The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. (arXiv:2201.03954v2 [cs.LG] UPDATED)
    As the production of and reliance on datasets to produce automated decision-making systems (ADS) increases, so does the need for processes for evaluating and interrogating the underlying data. After launching the Dataset Nutrition Label in 2018, the Data Nutrition Project has made significant updates to the design and purpose of the Label, and is launching an updated Label in late 2020, which is previewed in this paper. The new Label includes context-specific Use Cases &Alerts presented through an updated design and user interface targeted towards the data scientist profile. This paper discusses the harm and bias from underlying training data that the Label is intended to mitigate, the current state of the work including new datasets being labeled, new and existing challenges, and further directions of the work, as well as Figures previewing the new label.  ( 2 min )
    MarS-FL: Enabling Competitors to Collaborate in Federated Learning. (arXiv:2110.13464v2 [cs.LG] UPDATED)
    Federated learning (FL) is rapidly gaining popularity and enables multiple data owners ({\em a.k.a.} FL participants) to collaboratively train machine learning models in a privacy-preserving way. A key unaddressed scenario is that these FL participants are in a competitive market, where market shares represent their competitiveness. Although they are interested to enhance the performance of their respective models through FL, market leaders (who are often data owners who can contribute significantly to building high performance FL models) want to avoid losing their market shares by enhancing their competitors' models. Currently, there is no modeling tool to analyze such scenarios and support informed decision-making. In this paper, we bridge this gap by proposing the \underline{mar}ket \underline{s}hare-based decision support framework for participation in \underline{FL} (MarS-FL). We introduce {\em two notions of $\delta$-stable market} and {\em friendliness} to measure the viability of FL and the market acceptability of FL. The FL participants' behaviours can then be predicted using game theoretic tools (i.e., their optimal strategies concerning participation in FL). If the market $\delta$-stability is achievable, the final model performance improvement of each FL-PT shall be bounded, which relates to the market conditions of FL applications. We provide tight bounds and quantify the friendliness, $\kappa$, of given market conditions to FL. Experimental results show the viability of FL in a wide range of market conditions. Our results are useful for identifying the market conditions under which collaborative FL model training is viable among competitors, and the requirements that have to be imposed while applying FL under these conditions.  ( 2 min )
    projUNN: efficient method for training deep networks with unitary matrices. (arXiv:2203.05483v1 [cs.LG])
    In learning with recurrent or very deep feed-forward networks, employing unitary matrices in each layer can be very effective at maintaining long-range stability. However, restricting network parameters to be unitary typically comes at the cost of expensive parameterizations or increased training runtime. We propose instead an efficient method based on rank-$k$ updates -- or their rank-$k$ approximation -- that maintains performance at a nearly optimal training runtime. We introduce two variants of this method, named Direct (projUNN-D) and Tangent (projUNN-T) projected Unitary Neural Networks, that can parameterize full $N$-dimensional unitary or orthogonal matrices with a training runtime scaling as $O(kN^2)$. Our method either projects low-rank gradients onto the closest unitary matrix (projUNN-T) or transports unitary matrices in the direction of the low-rank gradient (projUNN-D). Even in the fastest setting ($k=1$), projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations. By integrating our projUNN algorithm into both recurrent and convolutional neural networks, our models can closely match or exceed benchmarked results from state-of-the-art algorithms.  ( 2 min )
    Distributed Deep Reinforcement Learning for Adaptive Medium Access and Modulation in Shared Spectrum. (arXiv:2109.11723v2 [eess.SP] UPDATED)
    Spectrum scarcity has led to growth in the use of unlicensed spectrum for cellular systems. This motivates intelligent adaptive approaches to spectrum access for both WiFi and 5G that improve upon traditional carrier sensing and listen-before-talk methods. We study decentralized contention-based medium access for base stations (BSs) of a single Radio Access Technology (RAT) operating on unlicensed shared spectrum. We devise a learning-based algorithm for both contention and adaptive modulation that attempts to maximize a network-wide downlink throughput objective. We formulate and develop novel distributed implementations of two deep reinforcement learning approaches - Deep Q Networks and Proximal Policy Optimization - modelled on a two stage Markov decision process. Empirically, we find the (proportional fairness) reward accumulated by the policy gradient approach to be significantly higher than even a genie-aided adaptive energy detection threshold. Our approaches are further validated by improved sum and peak throughput. The scalability of our approach to large networks is demonstrated via an improved cumulative reward earned on both indoor and outdoor layouts with a large number of BSs.  ( 2 min )
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v2 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Outlier-based Autism Detection using Longitudinal Structural MRI. (arXiv:2202.09988v2 [eess.IV] UPDATED)
    Diagnosis of Autism Spectrum Disorder (ASD) using clinical evaluation (cognitive tests) is challenging due to wide variations amongst individuals. Since no effective treatment exists, prompt and reliable ASD diagnosis can enable the effective preparation of treatment regimens. This paper proposes structural Magnetic Resonance Imaging (sMRI)-based ASD diagnosis via an outlier detection approach. To learn Spatio-temporal patterns in structural brain connectivity, a Generative Adversarial Network (GAN) is trained exclusively with sMRI scans of healthy subjects. Given a stack of three adjacent slices as input, the GAN generator reconstructs the next three adjacent slices; the GAN discriminator then identifies ASD sMRI scan reconstructions as outliers. This model is compared against two other baselines -- a simpler UNet and a sophisticated Self-Attention GAN. Axial, Coronal, and Sagittal sMRI slices from the multi-site ABIDE II dataset are used for evaluation. Extensive experiments reveal that our ASD detection framework performs comparably with the state-of-the-art with far fewer training data. Furthermore, longitudinal data (two scans per subject over time) achieve 17-28% higher accuracy than cross-sectional data (one scan per subject). Among other findings, metrics employed for model training as well as reconstruction loss computation impact detection performance, and the coronal modality is found to best encode structural information for ASD detection.  ( 2 min )
    Path Integral Sampler: a stochastic control approach for sampling. (arXiv:2111.15141v2 [cs.LG] UPDATED)
    We present Path Integral Sampler~(PIS), a novel algorithm to draw samples from unnormalized probability density functions. The PIS is built on the Schr\"odinger bridge problem which aims to recover the most likely evolution of a diffusion process given its initial distribution and terminal distribution. The PIS draws samples from the initial distribution and then propagates the samples through the Schr\"odinger bridge to reach the terminal distribution. Applying the Girsanov theorem, with a simple prior diffusion, we formulate the PIS as a stochastic optimal control problem whose running cost is the control energy and terminal cost is chosen according to the target distribution. By modeling the control as a neural network, we establish a sampling algorithm that can be trained end-to-end. We provide theoretical justification of the sampling quality of PIS in terms of Wasserstein distance when sub-optimal control is used. Moreover, the path integrals theory is used to compute importance weights of the samples to compensate for the bias induced by the sub-optimality of the controller and time-discretization. We experimentally demonstrate the advantages of PIS compared with other start-of-the-art sampling methods on a variety of tasks.  ( 2 min )
    Anti-Money Laundering Alert Optimization Using Machine Learning with Graphs. (arXiv:2112.07508v2 [cs.LG] UPDATED)
    Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually) such as drug dealing, human trafficking, or corruption. The anti-money laundering systems deployed by financial institutions typically comprise rules aligned with regulatory frameworks. Human investigators review the alerts and report suspicious cases. Such systems suffer from high false-positive rates, undermining their effectiveness and resulting in high operational costs. We propose a machine learning triage model, which complements the rule-based system and learns to predict the risk of an alert accurately. Our model uses both entity-centric engineered features and attributes characterizing inter-entity relations in the form of graph-based features. We leverage time windows to construct the dynamic graph, optimizing for time and space efficiency. We validate our model on a real-world banking dataset and show how the triage model can reduce the number of false positives by 80% while detecting over 90% of true positives. In this way, our model can significantly improve anti-money laundering operations.  ( 2 min )
    A Deep Variational Approach to Clustering Survival Data. (arXiv:2106.05763v3 [cs.LG] UPDATED)
    In this work, we study the problem of clustering survival data $-$ a challenging and so far under-explored task. We introduce a novel semi-supervised probabilistic approach to cluster survival data by leveraging recent advances in stochastic gradient variational inference. In contrast to previous work, our proposed method employs a deep generative model to uncover the underlying distribution of both the explanatory variables and censored survival times. We compare our model to the related work on clustering and mixture models for survival data in comprehensive experiments on a wide range of synthetic, semi-synthetic, and real-world datasets, including medical imaging data. Our method performs better at identifying clusters and is competitive at predicting survival times. Relying on novel generative assumptions, the proposed model offers a holistic perspective on clustering survival data and holds a promise of discovering subpopulations whose survival is regulated by different generative mechanisms.  ( 2 min )
    Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective. (arXiv:2111.14820v2 [cs.LG] UPDATED)
    Learning behavioral patterns from observational data has been a de-facto approach to motion forecasting. Yet, the current paradigm suffers from two shortcomings: brittle under distribution shifts and inefficient for knowledge transfer. In this work, we propose to address these challenges from a causal representation perspective. We first introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables, namely invariant variables, style confounders, and spurious features. We then introduce a learning framework that treats each group separately: (i) unlike the common practice mixing datasets collected from different locations, we exploit their subtle distinctions by means of an invariance loss encouraging the model to suppress spurious correlations; (ii) we devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a sparse causal graph; (iii) we introduce a style contrastive loss that not only enforces the structure of style representations but also serves as a self-supervisory signal for test-time refinement on the fly. Experiments on synthetic and real datasets show that our proposed method improves the robustness and reusability of learned motion representations, significantly outperforming prior state-of-the-art motion forecasting models for out-of-distribution generalization and low-shot transfer.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v1 [cs.LG])
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    Learning from Noisy Labels with Deep Neural Networks: A Survey. (arXiv:2007.08199v7 [cs.LG] UPDATED)
    Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack of high-quality labels in many real-world scenarios. As noisy labels severely degrade the generalization performance of deep neural networks, learning from noisy labels (robust training) is becoming an important task in modern deep learning applications. In this survey, we first describe the problem of learning with label noise from a supervised learning perspective. Next, we provide a comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority. Subsequently, we perform an in-depth analysis of noise rate estimation and summarize the typically used evaluation methodology, including public noisy datasets and evaluation metrics. Finally, we present several promising research directions that can serve as a guideline for future studies. All the contents will be available at https://github.com/songhwanjun/Awesome-Noisy-Labels.  ( 2 min )
    JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data. (arXiv:2106.00942v2 [cs.LG] UPDATED)
    The goal of Multi-task Bayesian Optimization (MBO) is to minimize the number of queries required to accurately optimize a target black-box function, given access to offline evaluations of other auxiliary functions. When offline datasets are large, the scalability of prior approaches comes at the expense of expressivity and inference quality. We propose JUMBO, an MBO algorithm that sidesteps these limitations by querying additional data based on a combination of acquisition signals derived from training two Gaussian Processes (GP): a cold-GP operating directly in the input domain and a warm-GP that operates in the feature space of a deep neural network pretrained using the offline data. Such a decomposition can dynamically control the reliability of information derived from the online and offline data and the use of pretrained neural networks permits scalability to large offline datasets. Theoretically, we derive regret bounds for JUMBO and show that it achieves no-regret under conditions analogous to GP-UCB (Srinivas et. al. 2010). Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems: hyper-parameter optimization and automated circuit design.  ( 2 min )
    A Contribution-based Device Selection Scheme in Federated Learning. (arXiv:2203.05369v1 [cs.LG])
    In a Federated Learning (FL) setup, a number of devices contribute to the training of a common model. We present a method for selecting the devices that provide updates in order to achieve improved generalization, fast convergence, and better device-level performance. We formulate a min-max optimization problem and decompose it into a primal-dual setup, where the duality gap is used to quantify the device-level performance. Our strategy combines \emph{exploration} of data freshness through a random device selection with \emph{exploitation} through simplified estimates of device contributions. This improves the performance of the trained model both in terms of generalization and personalization. A modified Truncated Monte-Carlo (TMC) method is applied during the exploitation phase to estimate the device's contribution and lower the communication overhead. The experimental results show that the proposed approach has a competitive performance, with lower communication overhead and competitive personalization performance against the baseline schemes.  ( 2 min )
    Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming. (arXiv:2109.11410v2 [cs.LG] UPDATED)
    A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data, in a paradigm that is now commonly referred to as data programming. However, previous approaches to automatically generate LFs make no attempt to further use the given labeled data for model training, thus giving up opportunities for improved performance. Moreover, since the LFs are generated from a relatively small labeled dataset, they are prone to being noisy, and naively aggregating these LFs can lead to very poor performance in practice. In this work, we propose an LF based reweighting framework \ouralgo{} to solve these two critical limitations. Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that our algorithm significantly outperforms prior approaches on several text classification datasets.  ( 2 min )
    Topologically-Informed Atlas Learning. (arXiv:2110.00429v2 [cs.RO] UPDATED)
    We present a new technique that enables manifold learning to accurately embed data manifolds that contain holes, without discarding any topological information. Manifold learning aims to embed high dimensional data into a lower dimensional Euclidean space by learning a coordinate chart, but it requires that the entire manifold can be embedded in a single chart. This is impossible for manifolds with holes. In such cases, it is necessary to learn an atlas: a collection of charts that collectively cover the entire manifold. We begin with many small charts, and combine them in a bottom-up approach, where charts are only combined if doing so will not introduce problematic topological features. When it is no longer possible to combine any charts, each chart is individually embedded with standard manifold learning techniques, completing the construction of the atlas. We show the efficacy of our method by constructing atlases for challenging synthetic manifolds; learning human motion embeddings from motion capture data; and learning kinematic models of articulated objects.  ( 2 min )
    On the finite representation of group equivariant operators via permutant measures. (arXiv:2008.06340v2 [math.GR] UPDATED)
    The study of $G$-equivariant operators is of great interest to explain and understand the architecture of neural networks. In this paper we show that each linear $G$-equivariant operator can be produced by a suitable permutant measure, provided that the group $G$ transitively acts on a finite signal domain $X$. This result makes available a new method to build linear $G$-equivariant operators in the finite setting.  ( 2 min )
    Deep Learning Based Automated COVID-19 Classification from Computed Tomography Images. (arXiv:2111.11191v2 [eess.IV] UPDATED)
    The paper presents a Convolutional Neural Networks (CNN) model for image classification, aiming at increasing predictive performance for COVID-19 diagnosis while avoiding deeper and thus more complex alternatives. The proposed model includes four similar convolutional layers followed by a flattening and two dense layers. This work proposes a less complex solution based on simply classifying 2D CT-Scan slices of images using their pixels via a 2D CNN model. Despite the simplicity in architecture, the proposed model showed improved quantitative results exceeding state-of-the-art on the same dataset of images, in terms of the macro f1 score. In this case study, extracting features from images, segmenting parts of the images, or other more complex techniques, ultimately aiming at images classification, do not yield better results. With that, this paper introduces a simple yet powerful deep learning based solution for automated COVID-19 classification.  ( 2 min )
    Context is Everything: Implicit Identification for Dynamics Adaptation. (arXiv:2203.05549v1 [cs.RO])
    Understanding environment dynamics is necessary for robots to act safely and optimally in the world. In realistic scenarios, dynamics are non-stationary and the causal variables such as environment parameters cannot necessarily be precisely measured or inferred, even during training. We propose Implicit Identification for Dynamics Adaptation (IIDA), a simple method to allow predictive models to adapt to changing environment dynamics. IIDA assumes no access to the true variations in the world and instead implicitly infers properties of the environment from a small amount of contextual data. We demonstrate IIDA's ability to perform well in unseen environments through a suite of simulated experiments on MuJoCo environments and a real robot dynamic sliding task. In general, IIDA significantly reduces model error and results in higher task performance over commonly used methods. Our code and robot videos are at https://bennevans.github.io/iida/  ( 2 min )
    Hierarchical Few-Shot Imitation with Skill Transition Models. (arXiv:2107.08981v2 [cs.LG] UPDATED)
    A desirable property of autonomous agents is the ability to both solve long-horizon problems and generalize to unseen tasks. Recent advances in data-driven skill learning have shown that extracting behavioral priors from offline data can enable agents to solve challenging long-horizon tasks with reinforcement learning. However, generalization to tasks unseen during behavioral prior training remains an outstanding challenge. To this end, we present Few-shot Imitation with Skill Transition Models (FIST), an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks given a few downstream demonstrations. FIST learns an inverse skill dynamics model, a distance function, and utilizes a semi-parametric approach for imitation. We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments requiring traversing unseen parts of a large maze and 7-DoF robotic arm experiments requiring manipulating previously unseen objects in a kitchen.  ( 2 min )
    Conditional Prompt Learning for Vision-Language Models. (arXiv:2203.05557v1 [cs.CV])
    With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at https://github.com/KaiyangZhou/CoOp.  ( 2 min )
    Filament Plots for Data Visualization. (arXiv:2107.10869v3 [cs.HC] UPDATED)
    The efficiency of modern computer graphics allows us to explore collections of space curves simultaneously with "drag-to-rotate" interfaces. This inspires us to replace "scatterplots of points" with "scatterplots of curves" to simultaneously visualize relationships across an entire dataset. Since spaces of curves are infinite dimensional, scatterplots of curves avoid the "lossy" nature of scatterplots of points. In particular, if two points are close in a scatterplot of points derived from high-dimensional data, it does not generally follow that the two associated data points are close in the data space. Standard Andrews plots provide scatterplots of curves that perfectly preserve Euclidean distances, but simultaneous visualization of these graphs over an entire dataset produces visual clutter because graphs of functions generally overlap in 2D. We mitigate this visual clutter issue by constructing computationally inexpensive 3D extensions of Andrews plots. First, we construct optimally smooth 3D Andrews plots by considering linear isometries from Euclidean data spaces to spaces of planar parametric curves. We rigorously parametrize the linear isometries that produce (on average) optimally smooth curves over a given dataset. This parameterization of optimal isometries reveals many degrees of freedom, and (using recent results on generalized Gauss sums) we identify a particular member of this set which admits an asymptotic "tour" property that avoids certain local degeneracies as well. Finally, we construct unit-length 3D curves (filaments) by numerically solving Frenet-Serret systems given data from these 3D Andrews plots. We conclude with examples of filament plots for several standard datasets, illustrating how filament plots avoid visual clutter. Code and examples available at https://github.com/n8epi/filaments/ and https://n8epi.github.io/filaments/  ( 2 min )
    Towards Less Constrained Macro-Neural Architecture Search. (arXiv:2203.05508v1 [cs.CV])
    Networks found with Neural Architecture Search (NAS) achieve state-of-the-art performance in a variety of tasks, out-performing human-designed networks. However, most NAS methods heavily rely on human-defined assumptions that constrain the search: architecture's outer-skeletons, number of layers, parameter heuristics and search spaces. Additionally, common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture's search space by designing entire architectures (macro-search). Imposing such constraints requires deep human expertise and restricts the search to pre-defined settings. In this paper, we propose LCMNAS, a method that pushes NAS to less constrained search spaces by performing macro-search without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on Weighted Directed Graphs with hidden properties, ii) a evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. We present experiments showing that LCMNAS generates state-of-the-art architectures from scratch with minimal GPU computation. We study the importance of different NAS components on a macro-search setting. Code for reproducibility is public at \url{https://github.com/VascoLopes/LCMNAS}.  ( 2 min )
    Toward Compact Deep Neural Networks via Energy-Aware Pruning. (arXiv:2103.10858v2 [cs.CV] UPDATED)
    Despite the remarkable performance, modern deep neural networks are inevitably accompanied by a significant amount of computational cost for learning and deployment, which may be incompatible with their usage on edge devices. Recent efforts to reduce these overheads involve pruning and decomposing the parameters of various layers without performance deterioration. Inspired by several decomposition studies, in this paper, we propose a novel energy-aware pruning method that quantifies the importance of each filter in the network using nuclear-norm (NN). Proposed energy-aware pruning leads to state-of-the-art performance for Top-1 accuracy, FLOPs, and parameter reduction across a wide range of scenarios with multiple network architectures on CIFAR-10 and ImageNet after fine-grained classification tasks. On toy experiment, without fine-tuning, we can visually observe that NN has a minute change in decision boundaries across classes and outperforms the previous popular criteria. We achieve competitive results with 40.4/49.8% of FLOPs and 45.9/52.9% of parameter reduction with 94.13/94.61% in the Top-1 accuracy with ResNet-56/110 on CIFAR-10, respectively. In addition, our observations are consistent for a variety of different pruning setting in terms of data size as well as data quality which can be emphasized in the stability of the acceleration and compression with negligible accuracy loss.  ( 2 min )
    deepregression: a Flexible Neural Network Framework for Semi-Structured Deep Distributional Regression. (arXiv:2104.02705v3 [stat.ML] UPDATED)
    In this paper we describe the implementation of semi-structured deep distributional regression, a flexible framework to learn conditional distributions based on the combination of additive regression models and deep networks. Our implementation encompasses (1) a modular neural network building system based on the deep learning library \pkg{TensorFlow} for the fusion of various statistical and deep learning approaches, (2) an orthogonalization cell to allow for an interpretable combination of different subnetworks, as well as (3) pre-processing steps necessary to set up such models. The software package allows to define models in a user-friendly manner via a formula interface that is inspired by classical statistical model frameworks such as \pkg{mgcv}. The packages' modular design and functionality provides a unique resource for both scalable estimation of complex statistical models and the combination of approaches from deep learning and statistics. This allows for state-of-the-art predictive performance while simultaneously retaining the indispensable interpretability of classical statistical models.  ( 2 min )
    Sparse Multi-Reference Alignment : Phase Retrieval, Uniform Uncertainty Principles and the Beltway Problem. (arXiv:2106.12996v3 [math.ST] UPDATED)
    Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $\sigma$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $\sigma \gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $\sigma^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $\sigma^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that sparse signals exhibit an intermediate $\sigma^4$ sample complexity even in the classical MRA model. Further, we characterise the dependence of the estimation rate on the support size $s$ as $O_p(1)$ and $O_p(s^{3.5})$ in the dilute and moderate regimes of sparsity respectively. Our techniques have implications for the problem of crystallographic phase retrieval, indicating a certain local uniqueness for the recovery of sparse signals from their power spectrum. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the beltway problem from combinatorial optimization, and uniform uncertainty principles from harmonic analysis. Our techniques include a certain enhanced form of the probabilistic method, which might be of general interest in its own right.  ( 2 min )
    Lower Bounds on the Total Variation Distance Between Mixtures of Two Gaussians. (arXiv:2109.01064v2 [math.PR] UPDATED)
    Mixtures of high dimensional Gaussian distributions have been studied extensively in statistics and learning theory. While the total variation distance appears naturally in the sample complexity of distribution learning, it is analytically difficult to obtain tight lower bounds for mixtures. Exploiting a connection between total variation distance and the characteristic function of the mixture, we provide fairly tight functional approximations. This enables us to derive new lower bounds on the total variation distance between pairs of two-component Gaussian mixtures that have a shared covariance matrix.  ( 2 min )
    SLGPT: Using Transfer Learning to Directly Generate Simulink Model Files and Find Bugs in the Simulink Toolchain. (arXiv:2105.07465v3 [cs.SE] UPDATED)
    Finding bugs in a commercial cyber-physical system (CPS) development tool such as Simulink is hard as its codebase contains millions of lines of code and complete formal language specifications are not available. While deep learning techniques promise to learn such language specifications from sample models, deep learning needs a large number of training data to work well. SLGPT addresses this problem by using transfer learning to leverage the powerful Generative Pre-trained Transformer 2 (GPT-2) model, which has been pre-trained on a large set of training data. SLGPT adapts GPT-2 to Simulink with both randomly generated models and models mined from open-source repositories. SLGPT produced Simulink models that are both more similar to open-source models than its closest competitor, DeepFuzzSL, and found a super-set of the Simulink development toolchain bugs found by DeepFuzzSL.  ( 2 min )
    Computationally Efficient Learning of Statistical Manifolds. (arXiv:2103.11773v2 [cs.LG] UPDATED)
    Analyzing high-dimensional data with manifold learning algorithms often requires searching for the nearest neighbors of all observations. This presents a computational bottleneck in statistical manifold learning when observations of probability distributions rather than vector-valued variables are available or when data size is large. We resolve this problem by proposing a new method for approximation in statistical manifold learning. The novelty of our approximation is the strongly consistent distance estimators based on independent and identically distributed samples from probability distributions. By exploiting the connection between Hellinger/total variation distance for discrete distributions and the L2/L1 norm, we demonstrate that the proposed distance estimators, combined with approximate nearest neighbor searching, could largely improve the computational efficiency with little to no loss in the accuracy of manifold embedding. The result is robust to different manifold learning algorithms and different approximate nearest neighbor algorithms. The proposed method is applied to learning statistical manifolds of electricity usage. This application demonstrates how underlying structures in high dimensional data, including anomalies, can be visualized and identified, in a way that is scalable to large datasets.  ( 2 min )
    Near-optimal Deep Reinforcement Learning Policies from Data for Zone Temperature Control. (arXiv:2203.05434v1 [cs.LG])
    Replacing poorly performing existing controllers with smarter solutions will decrease the energy intensity of the building sector. Recently, controllers based on Deep Reinforcement Learning (DRL) have been shown to be more effective than conventional baselines. However, since the optimal solution is usually unknown, it is still unclear if DRL agents are attaining near-optimal performance in general or if there is still a large gap to bridge. In this paper, we investigate the performance of DRL agents compared to the theoretically optimal solution. To that end, we leverage Physically Consistent Neural Networks (PCNNs) as simulation environments, for which optimal control inputs are easy to compute. Furthermore, PCNNs solely rely on data to be trained, avoiding the difficult physics-based modeling phase, while retaining physical consistency. Our results hint that DRL agents not only clearly outperform conventional rule-based controllers, they furthermore attain near-optimal performance.  ( 2 min )
    The Nonconvex Geometry of Linear Inverse Problems. (arXiv:2101.02776v2 [math.OC] UPDATED)
    The gauge function, closely related to the atomic norm, measures the complexity of a statistical model, and has found broad applications in machine learning and statistical signal processing. In a high-dimensional learning problem, the gauge function attempts to safeguard against overfitting by promoting a sparse (concise) representation within the learning alphabet. In this work, within the context of linear inverse problems, we pinpoint the source of its success, but also argue that the applicability of the gauge function is inherently limited by its convexity, and showcase several learning problems where the classical gauge function theory fails. We then introduce a new notion of statistical complexity, gauge$_p$ function, which overcomes the limitations of the gauge function. The gauge$_p$ function is a simple generalization of the gauge function that can tightly control the sparsity of a statistical model within the learning alphabet and, perhaps surprisingly, draws further inspiration from the Burer-Monteiro factorization in computational mathematics. We also propose a new learning machine, with the building block of gauge$_p$ function, and arm this machine with a number of statistical guarantees. The potential of the proposed gauge$_p$ function theory is then studied for two stylized applications. Finally, we discuss the computational aspects and, in particular, suggest a tractable numerical algorithm for implementing the new learning machine.  ( 2 min )
    Prediction-Guided Distillation for Dense Object Detection. (arXiv:2203.05469v1 [cs.CV])
    Real-world object detection models should be cheap and accurate. Knowledge distillation (KD) can boost the accuracy of a small, cheap detection model by leveraging useful information from a larger teacher model. However, a key challenge is identifying the most informative features produced by the teacher for distillation. In this work, we show that only a very small fraction of features within a ground-truth bounding box are responsible for a teacher's high detection performance. Based on this, we propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher and yields considerable gains in performance over many existing KD baselines. In addition, we propose an adaptive weighting scheme over the key regions to smooth out their influence and achieve even better performance. Our proposed approach outperforms current state-of-the-art KD baselines on a variety of advanced one-stage detection architectures. Specifically, on the COCO dataset, our method achieves between +3.1% and +4.6% AP improvement using ResNet-101 and ResNet-50 as the teacher and student backbones, respectively. On the CrowdHuman dataset, we achieve +3.2% and +2.0% improvements in MR and AP, also using these backbones. Our code is available at https://github.com/ChenhongyiYang/PGD.  ( 2 min )
    Forecasting the abnormal events at well drilling with machine learning. (arXiv:2203.05378v1 [cs.LG])
    We present a data-driven and physics-informed algorithm for drilling accident forecasting. The core machine-learning algorithm uses the data from the drilling telemetry representing the time-series. We have developed a Bag-of-features representation of the time series that enables the algorithm to predict the probabilities of six types of drilling accidents in real-time. The machine-learning model is trained on the 125 past drilling accidents from 100 different Russian oil and gas wells. Validation shows that the model can forecast 70% of drilling accidents with a false positive rate equals to 40%. The model addresses partial prevention of the drilling accidents at the well construction.  ( 2 min )
    Supervised ML Solution for Band Assignment in Dual-Band Systems with Omnidirectional and Directional Antennas. (arXiv:1902.10890v2 [cs.LG] UPDATED)
    Many wireless networks, including 5G NR (New Radio) and future beyond 5G cellular systems, are expected to operate on multiple frequency bands. This paper considers the band assignment (BA) problem in dual-band systems, where the basestation (BS) chooses one of the two available frequency bands (centimeter-wave and millimeter-wave bands) to communicate with the user equipment (UE). While the millimeter-wave band might offer higher data rate, there is a significant probability of outage during which the communication should be carried on the (more reliable) centimeter-wave band. With mobility, the BA can be perceived as a sequential problem, where the BS uses previously observed information to predict the best band for a future time step. We formulate the BA as a binary classification problem and propose supervised Machine Learning (ML) solutions. We study the problem when both the BS and the UE use (i) omnidirectional antennas and (ii) both use directional antennas. In the omnidirectional case, we derive analytical benchmark solutions based on the Gaussian Process (GP) assumption for the inter-band shadow fading. In the directional case, where the labeling is shown to be complex, we propose an efficient labeling approach based on the Viterbi Algorithm (VA). We compare the performances for two channel models: (i) a stochastic channel and (ii) a ray-tracing based channel.  ( 2 min )
    A Locally Adaptive Interpretable Regression. (arXiv:2005.03350v3 [stat.ML] UPDATED)
    Machine learning models with both good predictability and high interpretability are crucial for decision support systems. Linear regression is one of the most interpretable prediction models. However, the linearity in a simple linear regression worsens its predictability. In this work, we introduce a locally adaptive interpretable regression (LoAIR). In LoAIR, a metamodel parameterized by neural networks predicts percentile of a Gaussian distribution for the regression coefficients for a rapid adaptation. Our experimental results on public benchmark datasets show that our model not only achieves comparable or better predictive performance than the other state-of-the-art baselines but also discovers some interesting relationships between input and target variables such as a parabolic relationship between CO2 emissions and Gross National Product (GNP). Therefore, LoAIR is a step towards bridging the gap between econometrics, statistics, and machine learning by improving the predictive ability of linear regression without depreciating its interpretability.  ( 2 min )
    A General Pairwise Comparison Model for Extremely Sparse Networks. (arXiv:2002.08853v3 [stat.ML] UPDATED)
    Statistical inference using pairwise comparison data is an effective approach to analyzing large-scale sparse networks. In this paper, we propose a general framework to model the mutual interactions in a network, which enjoys ample flexibility in terms of model parametrization. Under this setup, we show that the maximum likelihood estimator for the latent score vector of the subjects is uniformly consistent under a near-minimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. Our analysis utilizes a novel chaining technique and illustrates an important connection between graph topology and model consistency. Our results guarantee that the maximum likelihood estimator is justified for estimation in large-scale pairwise comparison networks where data are asymptotically deficient. Simulation studies are provided in support of our theoretical findings.  ( 2 min )
    On Embeddings for Numerical Features in Tabular Deep Learning. (arXiv:2203.05556v1 [cs.LG])
    Recently, Transformer-like deep architectures have shown strong performance on tabular data problems. Unlike traditional models, e.g., MLP, these architectures map scalar values of numerical features to high-dimensional embeddings before mixing them in the main backbone. In this work, we argue that embeddings for numerical features are an underexplored degree of freedom in tabular DL, which allows constructing more powerful DL models and competing with GBDT on some traditionally GBDT-friendly benchmarks. We start by describing two conceptually different approaches to building embedding modules: the first one is based on a piecewise linear encoding of scalar values, and the second one utilizes periodic activations. Then, we empirically demonstrate that these two approaches can lead to significant performance boosts compared to the embeddings based on conventional blocks such as linear layers and ReLU activations. Importantly, we also show that embedding numerical features is beneficial for many backbones, not only for Transformers. Specifically, after proper embeddings, simple MLP-like models can perform on par with the attention-based architectures. Overall, we highlight that embeddings for numerical features are an important design aspect, which has good potential for further improvements in tabular DL.  ( 2 min )
    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. (arXiv:2203.05482v1 [cs.LG])
    The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94% top-1 accuracy on ImageNet, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.  ( 2 min )
    Geometric and Topological Inference for Deep Representations of Complex Networks. (arXiv:2203.05488v1 [cs.LG])
    Understanding the deep representations of complex networks is an important step of building interpretable and trustworthy machine learning applications in the age of internet. Global surrogate models that approximate the predictions of a black box model (e.g. an artificial or biological neural net) are usually used to provide valuable theoretical insights for the model interpretability. In order to evaluate how well a surrogate model can account for the representation in another model, we need to develop inference methods for model comparison. Previous studies have compared models and brains in terms of their representational geometries (characterized by the matrix of distances between representations of the input patterns in a model layer or cortical area). In this study, we propose to explore these summary statistical descriptions of representations in models and brains as part of a broader class of statistics that emphasize the topology as well as the geometry of representations. The topological summary statistics build on topological data analysis (TDA) and other graph-based methods. We evaluate these statistics in terms of the sensitivity and specificity that they afford when used for model selection, with the goal to relate different neural network models to each other and to make inferences about the computational mechanism that might best account for a black box representation. These new methods enable brain and computer scientists to visualize the dynamic representational transformations learned by brains and models, and to perform model-comparative statistical inference.  ( 2 min )
    LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval. (arXiv:2203.05465v1 [cs.CV])
    Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective with just a few hard negative examples. Experiments on two standard datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual encoder performance when compared with approaches using a similar amount of data.  ( 2 min )
    An Empirical Study of Low Precision Quantization for TinyML. (arXiv:2203.05492v1 [cs.LG])
    Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important model compression technique that can greatly reduce both memory consumption and computation cost of model inference. In this study, we focus on post-training quantization (PTQ) algorithms that quantize a model to low-bit (less than 8-bit) precision with only a small set of calibration data and benchmark them on different tinyML use cases. To achieve a fair comparison, we build a simulated quantization framework to investigate recent PTQ algorithms. Furthermore, we break down those algorithms into essential components and re-assembled a generic PTQ pipeline. With ablation study on different alternatives of components in the pipeline, we reveal key design choices when performing low precision quantization. We hope this work could provide useful data points and shed lights on the future research of low precision quantization.  ( 2 min )
    Bias-variance decomposition of overparameterized regression with random linear features. (arXiv:2203.05443v1 [stat.ML])
    In classical statistics, the bias-variance trade-off describes how varying a model's complexity (e.g., number of fit parameters) affects its ability to make accurate predictions. According to this trade-off, optimal performance is achieved when a model is expressive enough to capture trends in the data, yet not so complex that it overfits idiosyncratic features of the training data. Recently, it has become clear that this classic understanding of the bias-variance must be fundamentally revisited in light of the incredible predictive performance of "overparameterized models" -- models that avoid overfitting even when the number of fit parameters is large enough to perfectly fit the training data. Here, we present results for one of the simplest examples of an overparameterized model: regression with random linear features (i.e. a two-layer neural network with a linear activation function). Using the zero-temperature cavity method, we derive analytic expressions for the training error, test error, bias, and variance. We show that the linear random features model exhibits three phase transitions: two different transitions to an interpolation regime where the training error is zero, along with an additional transition between regimes with large bias and minimal bias. Using random matrix theory, we show how each transition arises due to small nonzero eigenvalues in the Hessian matrix. Finally, we compare and contrast the phase diagram of the random linear features model to the random nonlinear features model and ordinary regression, highlighting the new phase transitions that result from the use of linear basis functions.  ( 2 min )
    Differentially Private Learning Needs Hidden State (Or Much Faster Convergence). (arXiv:2203.05363v1 [stat.ML])
    Differential privacy analysis of randomized learning algorithms typically relies on composition theorems, where the implicit assumption is that the internal state of the iterative algorithm is revealed to the adversary. However, by assuming hidden states for DP algorithms (when only the last-iterate is observable), recent works prove a converging privacy bound for noisy gradient descent (on strongly convex smooth loss function) that is significantly smaller than composition bounds after $O(1/\text{step-size})$ epochs. In this paper, we extend this hidden-state analysis to the noisy mini-batch stochastic gradient descent algorithms on strongly-convex smooth loss functions. We prove converging R\'enyi DP bounds under various mini-batch sampling schemes, such as "shuffle and partition" (which are used in practical implementations of DP-SGD) and "sampling without replacement". We prove that, in these settings, our privacy bound is much smaller than the composition bound for training with a large number of iterations (which is the case for learning from high-dimensional data). Our converging privacy analysis, thus, shows that differentially private learning, with a tight bound, needs hidden state privacy analysis or a fast convergence. To complement our theoretical results, we run experiment on training classification models on MNIST, FMNIST and CIFAR-10 datasets, and observe a better accuracy given fixed privacy budgets, under the hidden-state analysis.  ( 2 min )
    API: Boosting Multi-Agent Reinforcement Learning via Agent-Permutation-Invariant Networks. (arXiv:2203.05285v1 [cs.LG])
    Multi-agent reinforcement learning suffers from poor sample efficiency due to the exponential growth of the state-action space. Considering a homogeneous multiagent system, a global state consisting of $m$ homogeneous components has $m!$ differently ordered representations, thus designing functions satisfying permutation invariant (PI) can reduce the state space by a factor of $\frac{1}{m!}$. However, mainstream MARL algorithms ignore this property and learn over the original state space. To achieve PI, previous works including data augmentation based methods and embedding-sharing architecture based methods, suffer from training instability and limited model capacity. In this work, we propose two novel designs to achieve PI, while avoiding the above limitations. The first design permutes the same but differently ordered inputs back to the same order and the downstream networks only need to learn function mapping over fixed-ordering inputs instead of all permutations, which is much easier to train. The second design applies a hypernetwork to generate customized embedding for each component, which has higher representational capacity than the previous embedding-sharing method. Empirical results on the SMAC benchmark show that the proposed method achieves 100% win-rates in almost all hard and super-hard scenarios (never achieved before), and superior sample-efficiency than the state-of-the-art baselines by up to 400%.  ( 2 min )
    Blind Extraction of Equitable Partitions from Graph Signals. (arXiv:2203.05407v1 [math.OC])
    Finding equitable partitions is closely related to the extraction of graph symmetries and of interest in a variety of applications context such as node role detection, cluster synchronization, consensus dynamics, and network control problems. In this work we study a blind identification problem in which we aim to recover an equitable partition of a network without the knowledge of the network's edges but based solely on the observations of the outputs of an unknown graph filter. Specifically, we consider two settings. First, we consider a scenario in which we can control the input to the graph filter and present a method to extract the partition inspired by the well known Weisfeiler-Lehman (color refinement) algorithm. Second, we generalize this idea to a setting where only observe the outputs to random, low-rank excitations of the graph filter, and present a simple spectral algorithm to extract the relevant equitable partitions. Finally, we establish theoretical bounds on the error that this spectral detection scheme incurs and perform numerical experiments that illustrate our theoretical results and compare both algorithms.  ( 2 min )
    AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First -- Using Relation Extraction to Identify Entities. (arXiv:2203.05325v1 [cs.CL])
    In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction. This means that the system can be trained even on data sets where only a subset of all valid entity spans is annotated. We provide an extensive evaluation of the proposed system and its strengths and weaknesses. Our approach, which can be scaled dynamically in computational complexity at inference time, produces predictions with high precision and reaches 3rd place in the leaderboard of SemEval-2022 Task 12. For inputs in the domain of physics and math, it achieves high relation extraction macro f1 scores of 95.43% and 79.17%, respectively. The code used for training and evaluating our models is available at: https://github.com/nicpopovic/RE1st  ( 2 min )
    Exploiting the Potential of Datasets: A Data-Centric Approach for Model Robustness. (arXiv:2203.05323v1 [cs.LG])
    Robustness of deep neural networks (DNNs) to malicious perturbations is a hot topic in trustworthy AI. Existing techniques obtain robust models given fixed datasets, either by modifying model structures, or by optimizing the process of inference or training. While significant improvements have been made, the possibility of constructing a high-quality dataset for model robustness remain unexplored. Follow the campaign of data-centric AI launched by Andrew Ng, we propose a novel algorithm for dataset enhancement that works well for many existing DNN models to improve robustness. Transferable adversarial examples and 14 kinds of common corruptions are included in our optimized dataset. In the data-centric robust learning competition hosted by Alibaba Group and Tsinghua University, our algorithm came third out of more than 3000 competitors in the first stage while we ranked fourth in the second stage. Our code is available at \url{https://github.com/hncszyq/tianchi_challenge}.  ( 2 min )
    Deep Regression Ensembles. (arXiv:2203.05417v1 [stat.ML])
    We introduce a methodology for designing and training deep neural networks (DNN) that we call "Deep Regression Ensembles" (DRE). It bridges the gap between DNN and two-layer neural networks trained with random feature regression. Each layer of DRE has two components, randomly drawn input weights and output weights trained myopically (as if the final output layer) using linear ridge regression. Within a layer, each neuron uses a different subset of inputs and a different ridge penalty, constituting an ensemble of random feature ridge regressions. Our experiments show that a single DRE architecture is at par with or exceeds state-of-the-art DNN in many data sets. Yet, because DRE neural weights are either known in closed-form or randomly drawn, its computational cost is orders of magnitude smaller than DNN.  ( 2 min )
    Back to Reality: Weakly-supervised 3D Object Detection with Shape-guided Label Enhancement. (arXiv:2203.05238v1 [cs.CV])
    In this paper, we propose a weakly-supervised approach for 3D object detection, which makes it possible to train strong 3D detector with position-level annotations (i.e. annotations of object centers). In order to remedy the information loss from box annotations to centers, our method, namely Back to Reality (BR), makes use of synthetic 3D shapes to convert the weak labels into fully-annotated virtual scenes as stronger supervision, and in turn utilizes the perfect virtual labels to complement and refine the real labels. Specifically, we first assemble 3D shapes into physically reasonable virtual scenes according to the coarse scene layout extracted from position-level annotations. Then we go back to reality by applying a virtual-to-real domain adaptation method, which refine the weak labels and additionally supervise the training of detector with the virtual scenes. Furthermore, we propose a more challenging benckmark for indoor 3D object detection with more diversity in object sizes to better show the potential of BR. With less than 5% of the labeling labor, we achieve comparable detection performance with some popular fully-supervised approaches on the widely used ScanNet dataset. Code is available at: https://github.com/xuxw98/BackToReality  ( 2 min )
    CoCo-FL: Communication- and Computation-Aware Federated Learning via Partial NN Freezing and Quantization. (arXiv:2203.05468v1 [cs.LG])
    Devices participating in federated learning (FL) typically have heterogeneous communication and computation resources. However, all devices need to finish training by the same deadline dictated by the server when applying synchronous FL, as we consider in this paper. Reducing the complexity of the trained neural network (NN) at constrained devices, i.e., by dropping neurons/filters, is insufficient as it tightly couples reductions in communication and computation requirements, wasting resources. Quantization has proven effective to accelerate inference, but quantized training suffers from accuracy losses. We present a novel mechanism that quantizes during training parts of the NN to reduce the computation requirements, freezes them to reduce the communication and computation requirements, and trains the remaining parts in full precision to maintain a high convergence speed and final accuracy. Using this mechanism, we present the first FL technique that independently optimizes for specific communication and computation constraints in FL: CoCo-FL. We show that CoCo-FL reaches a much higher convergence speed than the state of the art and a significantly higher final accuracy.  ( 2 min )
    Robustness Analysis of Classification Using Recurrent Neural Networks with Perturbed Sequential Input. (arXiv:2203.05403v1 [cs.LG])
    For a given stable recurrent neural network (RNN) that is trained to perform a classification task using sequential inputs, we quantify explicit robustness bounds as a function of trainable weight matrices. The sequential inputs can be perturbed in various ways, e.g., streaming images can be deformed due to robot motion or imperfect camera lens. Using the notion of the Voronoi diagram and Lipschitz properties of stable RNNs, we provide a thorough analysis and characterize the maximum allowable perturbations while guaranteeing the full accuracy of the classification task. We illustrate and validate our theoretical results using a map dataset with clouds as well as the MNIST dataset.  ( 2 min )
    Semantic Norm Recognition and its application to Portuguese Law. (arXiv:2203.05425v1 [cs.CL])
    Being able to clearly interpret legal texts and fully understanding our rights, obligations and other legal norms has become progressively more important in the digital society. However, simply giving citizens access to the laws is not enough, as there is a need to provide meaningful information that cater to their specific queries and needs. For this, it is necessary to extract the relevant semantic information present in legal texts. Thus, we introduce the SNR (Semantic Norm Recognition) system, an automatic semantic information extraction system trained on a domain-specific (legal) text corpus taken from Portuguese Consumer Law. The SNR system uses the Portuguese Bert (BERTimbau) and was trained on a legislative Portuguese corpus. We demonstrate how our system achieved good results (81.44\% F1-score) on this domain-specific corpus, despite existing noise, and how it can be used to improve downstream tasks such as information retrieval.  ( 2 min )
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Regression. (arXiv:2203.05400v1 [math.ST])
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, such as the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood and cross-validation estimates of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimates cannot remain below $\nu_0$ as more data are obtained. These results are based on a general theorem, proved using reproducing kernel Hilbert space techniques, about sets of values the parameter estimates cannot take and approximation theory in Sobolev spaces.  ( 2 min )
    Clustering Label Inference Attack against Practical Split Learning. (arXiv:2203.05222v1 [cs.LG])
    Split learning is deemed as a promising paradigm for privacy-preserving distributed learning, where the learning model can be cut into multiple portions to be trained at the participants collaboratively. The participants only exchange the intermediate learning results at the cut layer, including smashed data via forward-pass (i.e., features extracted from the raw data) and gradients during backward-propagation.Understanding the security performance of split learning is critical for various privacy-sensitive applications.With the emphasis on private labels, this paper proposes a passive clustering label inference attack for practical split learning. The adversary (either clients or servers) can accurately retrieve the private labels by collecting the exchanged gradients and smashed data.We mathematically analyse potential label leakages in split learning and propose the cosine and Euclidean similarity measurements for clustering attack. Experimental results validate that the proposed approach is scalable and robust under different settings (e.g., cut layer positions, epochs, and batch sizes) for practical split learning.The adversary can still achieve accurate predictions, even when differential privacy and gradient compression are adopted for label protections.  ( 2 min )
    TiSAT: Time Series Anomaly Transformer. (arXiv:2203.05167v1 [cs.LG])
    While anomaly detection in time series has been an active area of research for several years, most recent approaches employ an inadequate evaluation criterion leading to an inflated F1 score. We show that a rudimentary Random Guess method can outperform state-of-the-art detectors in terms of this popular but faulty evaluation criterion. In this work, we propose a proper evaluation metric that measures the timeliness and precision of detecting sequential anomalies. Moreover, most existing approaches are unable to capture temporal features from long sequences. Self-attention based approaches, such as transformers, have been demonstrated to be particularly efficient in capturing long-range dependencies while being computationally efficient during training and inference. We also propose an efficient transformer approach for anomaly detection in time series and extensively evaluate our proposed approach on several popular benchmark datasets.  ( 2 min )
    TextConvoNet:A Convolutional Neural Network based Architecture for Text Classification. (arXiv:2203.05173v1 [cs.CL])
    In recent years, deep learning-based models have significantly improved the Natural Language Processing (NLP) tasks. Specifically, the Convolutional Neural Network (CNN), initially used for computer vision, has shown remarkable performance for text data in various NLP problems. Most of the existing CNN-based models use 1-dimensional convolving filters n-gram detectors), where each filter specialises in extracting n-grams features of a particular input word embedding. The input word embeddings, also called sentence matrix, is treated as a matrix where each row is a word vector. Thus, it allows the model to apply one-dimensional convolution and only extract n-gram based features from a sentence matrix. These features can be termed as intra-sentence n-gram features. To the extent of our knowledge, all the existing CNN models are based on the aforementioned concept. In this paper, we present a CNN-based architecture TextConvoNet that not only extracts the intra-sentence n-gram features but also captures the inter-sentence n-gram features in input text data. It uses an alternative approach for input matrix representation and applies a two-dimensional multi-scale convolutional operation on the input. To evaluate the performance of TextConvoNet, we perform an experimental study on five text classification datasets. The results are evaluated by using various performance metrics. The experimental results show that the presented TextConvoNet outperforms state-of-the-art machine learning and deep learning models for text classification purposes.
    BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. (arXiv:2203.05297v1 [cs.CV])
    Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem, due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations.Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating the human gestures, which may contribute to a number of different research fields including controllable gesture synthesis, cross-modality analysis, emotional gesture recognition. The data, code and model will be released for research.  ( 2 min )
    Manifold Modeling in Quotient Space: Learning An Invariant Mapping with Decodability of Image Patches. (arXiv:2203.05134v1 [cs.CV])
    This study proposes a framework for manifold learning of image patches using the concept of equivalence classes: manifold modeling in quotient space (MMQS). In MMQS, we do not consider a set of local patches of the image as it is, but rather the set of their canonical patches obtained by introducing the concept of equivalence classes and performing manifold learning on their canonical patches. Canonical patches represent equivalence classes, and their auto-encoder constructs a manifold in the quotient space. Based on this framework, we produce a novel manifold-based image model by introducing rotation-flip-equivalence relations. In addition, we formulate an image reconstruction problem by fitting the proposed image model to a corrupted observed image and derive an algorithm to solve it. Our experiments show that the proposed image model is effective for various self-supervised image reconstruction tasks, such as image inpainting, deblurring, super-resolution, and denoising.  ( 2 min )
    Assessing Phenotype Definitions for Algorithmic Fairness. (arXiv:2203.05174v1 [q-bio.OT])
    Disease identification is a core, routine activity in observational health research. Cohorts impact downstream analyses, such as how a condition is characterized, how patient risk is defined, and what treatments are studied. It is thus critical to ensure that selected cohorts are representative of all patients, independently of their demographics or social determinants of health. While there are multiple potential sources of bias when constructing phenotype definitions which may affect their fairness, it is not standard in the field of phenotyping to consider the impact of different definitions across subgroups of patients. In this paper, we propose a set of best practices to assess the fairness of phenotype definitions. We leverage established fairness metrics commonly used in predictive models and relate them to commonly used epidemiological cohort description metrics. We describe an empirical study for Crohn's disease and diabetes type 2, each with multiple phenotype definitions taken from the literature across two sets of patient subgroups (gender and race). We show that the different phenotype definitions exhibit widely varying and disparate performance according to the different fairness metrics and subgroups. We hope that the proposed best practices can help in constructing fair and inclusive phenotype definitions.  ( 2 min )
    A Tree-Structured Multi-Task Model Recommender. (arXiv:2203.05092v1 [cs.LG])
    Tree-structured multi-task architectures have been employed to jointly tackle multiple vision tasks in the context of multi-task learning (MTL). The major challenge is to determine where to branch out for each task given a backbone model to optimize for both task accuracy and computation efficiency. To address the challenge, this paper proposes a recommender that, given a set of tasks and a convolutional neural network-based backbone model, automatically suggests tree-structured multi-task architectures that could achieve a high task performance while meeting a user-specified computation budget without performing model training. Extensive evaluations on popular MTL benchmarks show that the recommended architectures could achieve competitive task accuracy and computation efficiency compared with state-of-the-art MTL methods.  ( 2 min )
    IAE-Net: Integral Autoencoders for Discretization-Invariant Learning. (arXiv:2203.05142v1 [cs.LG])
    Discretization invariant learning aims at learning in the infinite-dimensional function spaces with the capacity to process heterogeneous discrete representations of functions as inputs and/or outputs of a learning model. This paper proposes a novel deep learning framework based on integral autoencoders (IAE-Net) for discretization invariant learning. The basic building block of IAE-Net consists of an encoder and a decoder as integral transforms with data-driven kernels, and a fully connected neural network between the encoder and decoder. This basic building block is applied in parallel in a wide multi-channel structure, which are repeatedly composed to form a deep and densely connected neural network with skip connections as IAE-Net. IAE-Net is trained with randomized data augmentation that generates training data with heterogeneous structures to facilitate the performance of discretization invariant learning. The proposed IAE-Net is tested with various applications in predictive data science, solving forward and inverse problems in scientific computing, and signal/image processing. Compared with alternatives in the literature, IAE-Net achieves state-of-the-art performance in existing applications and creates a wide range of new applications.  ( 2 min )
    A Review of Open Source Software Tools for Time Series Analysis. (arXiv:2203.05195v1 [cs.LG])
    Time series data is used in a wide range of real world applications. In a variety of domains , detailed analysis of time series data (via Forecasting and Anomaly Detection) leads to a better understanding of how events associated with a specific time instance behave. Time Series Analysis (TSA) is commonly performed with plots and traditional models. Machine Learning (ML) approaches , on the other hand , have seen an increase in the state of the art for Forecasting and Anomaly Detection because they provide comparable results when time and data constraints are met. A number of time series toolboxes are available that offer rich interfaces to specific model classes (ARIMA/filters , neural networks) or framework interfaces to isolated time series modelling tasks (forecasting , feature extraction , annotation , classification). Nonetheless , open source machine learning capabilities for time series remain limited , and existing libraries are frequently incompatible with one another. The goal of this paper is to provide a concise and user friendly overview of the most important open source tools for time series analysis. This article examines two related toolboxes (1) forecasting and (2) anomaly detection. This paper describes a typical Time Series Analysis (TSA) framework with an architecture and lists the main features of TSA framework. The tools are categorized based on the criteria of analysis tasks completed , data preparation methods employed , and evaluation methods for results generated. This paper presents quantitative analysis and discusses the current state of actively developed open source Time Series Analysis frameworks. Overall , this article considered 60 time series analysis tools , and 32 of which provided forecasting modules , and 21 packages included anomaly detection.  ( 3 min )
    The Transitive Information Theory and its Application to Deep Generative Models. (arXiv:2203.05074v1 [cs.LG])
    Paradoxically, a Variational Autoencoder (VAE) could be pushed in two opposite directions, utilizing powerful decoder model for generating realistic images but collapsing the learned representation, or increasing regularization coefficient for disentangling representation but ultimately generating blurry examples. Existing methods narrow the issues to the rate-distortion trade-off between compression and reconstruction. We argue that a good reconstruction model does learn high capacity latents that encode more details, however, its use is hindered by two major issues: the prior is random noise which is completely detached from the posterior and allow no controllability in the generation; mean-field variational inference doesn't enforce hierarchy structure which makes the task of recombining those units into plausible novel output infeasible. As a result, we develop a system that learns a hierarchy of disentangled representation together with a mechanism for recombining the learned representation for generalization. This is achieved by introducing a minimal amount of inductive bias to learn controllable prior for the VAE. The idea is supported by here developed transitive information theory, that is, the mutual information between two target variables could alternately be maximized through the mutual information to the third variable, thus bypassing the rate-distortion bottleneck in VAE design. In particular, we show that our model, named SemafoVAE (inspired by the similar concept in computer science), could generate high-quality examples in a controllable manner, perform smooth traversals of the disentangled factors and intervention at a different level of representation hierarchy.  ( 2 min )
    SAGE: Generating Symbolic Goals for Myopic Models in Deep Reinforcement Learning. (arXiv:2203.05079v1 [cs.LG])
    Model-based reinforcement learning algorithms are typically more sample efficient than their model-free counterparts, especially in sparse reward problems. Unfortunately, many interesting domains are too complex to specify the complete models required by traditional model-based approaches. Learning a model takes a large number of environment samples, and may not capture critical information if the environment is hard to explore. If we could specify an incomplete model and allow the agent to learn how best to use it, we could take advantage of our partial understanding of many domains. Existing hybrid planning and learning systems which address this problem often impose highly restrictive assumptions on the sorts of models which can be used, limiting their applicability to a wide range of domains. In this work we propose SAGE, an algorithm combining learning and planning to exploit a previously unusable class of incomplete models. This combines the strengths of symbolic planning and neural learning approaches in a novel way that outperforms competing methods on variations of taxi world and Minecraft.  ( 2 min )
    Internet-augmented language models through few-shot prompting for open-domain question answering. (arXiv:2203.05115v1 [cs.CL])
    In this work, we aim to capitalize on the unique few-shot capabilities offered by large-scale language models to overcome some of their challenges with respect to grounding to factual and up-to-date information. Motivated by semi-parametric language models, which ground their decisions in external retrieved evidence, we use few-shot prompting to learn to condition language models on information returned from the web using Google Search, a broad and constantly updated knowledge source. Our approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any language model, offering like this a strong baseline. Indeed, we find that language models conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering. Finally, we find that increasing the inference-time compute of models, achieved via using multiple retrieved evidences to generate multiple answers followed by a reranking stage, alleviates generally decreased performance of smaller few-shot language models. All in all, our findings suggest that it might be beneficial to slow down the race towards the biggest model and instead shift the attention towards finding more effective ways to use models, including but not limited to better prompting or increasing inference-time compute.  ( 2 min )
    Action-Constrained Reinforcement Learning for Frame-Level Bit Allocation in HEVC/H.265 through Frank-Wolfe Policy Optimization. (arXiv:2203.05127v1 [eess.IV])
    This paper presents a reinforcement learning (RL) framework that leverages Frank-Wolfe policy optimization to address frame-level bit allocation for HEVC/H.265. Most previous RL-based approaches adopt the single-critic design, which weights the rewards for distortion minimization and rate regularization by an empirically chosen hyper-parameter. More recently, the dual-critic design is proposed to update the actor network by alternating the rate and distortion critics. However, the convergence of training is not guaranteed. To address this issue, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the frame-level bit allocation as an action-constrained RL problem. In this new framework, the rate critic serves to specify a feasible action set, and the distortion critic updates the actor network towards maximizing the reconstruction quality while conforming to the action constraint. Experimental results show that when trained to optimize the video multi-method assessment fusion (VMAF) metric, our NFWPO-based model outperforms both the single-critic and the dual-critic methods. It also demonstrates comparable rate-distortion performance to the 2-pass average bit rate control of x265.  ( 2 min )
    Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature. (arXiv:2203.05112v1 [cs.DL])
    Data citations provide a foundation for studying research data impact. Collecting and managing data citations is a new frontier in archival science and scholarly communication. However, the discovery and curation of research data citations is labor intensive. Data citations that reference unique identifiers (i.e. DOIs) are readily findable; however, informal mentions made to research data are more challenging to infer. We propose a natural language processing (NLP) paradigm to support the human task of identifying informal mentions made to research datasets. The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research (ICPSR), a large social science data archive that maintains a large bibliography of data-related literature. The NLP model is bootstrapped from data citations actively collected by librarians at ICPSR. The model combines pattern matching with multiple iterations of human annotations to learn additional rules for detecting informal data mentions. These examples are then used to train an NLP pipeline. The librarian-in-the-loop paradigm is centered in the data work performed by ICPSR librarians, supporting broader efforts to build a more comprehensive bibliography of data-related literature that reflects the scholarly communities of research data users.  ( 2 min )
    Collusion Detection in Team-Based Multiplayer Games. (arXiv:2203.05121v1 [cs.LG])
    In the context of competitive multiplayer games, collusion happens when two or more teams decide to collaborate towards a common goal, with the intention of gaining an unfair advantage from this cooperation. The task of identifying colluders from the player population is however infeasible to game designers due to the sheer size of the player population. In this paper, we propose a system that detects colluding behaviors in team-based multiplayer games and highlights the players that most likely exhibit colluding behaviors. The game designers then proceed to analyze a smaller subset of players and decide what action to take. For this reason, it is important and necessary to be extremely careful with false positives when automating the detection. The proposed method analyzes the players' social relationships paired with their in-game behavioral patterns and, using tools from graph theory, infers a feature set that allows us to detect and measure the degree of collusion exhibited by each pair of players from opposing teams. We then automate the detection using Isolation Forest, an unsupervised learning technique specialized in highlighting outliers, and show the performance and efficiency of our approach on two real datasets, each with over 170,000 unique players and over 100,000 different matches.  ( 2 min )
    On the influence of over-parameterization in manifold based surrogates and deep neural operators. (arXiv:2203.05071v1 [cs.LG])
    Constructing accurate and generalizable approximators for complex physico-chemical processes exhibiting highly non-smooth dynamics is challenging. In this work, we propose new developments and perform comparisons for two promising approaches: manifold-based polynomial chaos expansion (m-PCE) and the deep neural operator (DeepONet), and we examine the effect of over-parameterization on generalization. We demonstrate the performance of these methods in terms of generalization accuracy by solving the 2D time-dependent Brusselator reaction-diffusion system with uncertainty sources, modeling an autocatalytic chemical reaction between two species. We first propose an extension of the m-PCE by constructing a mapping between latent spaces formed by two separate embeddings of input functions and output QoIs. To enhance the accuracy of the DeepONet, we introduce weight self-adaptivity in the loss function. We demonstrate that the performance of m-PCE and DeepONet is comparable for cases of relatively smooth input-output mappings. However, when highly non-smooth dynamics is considered, DeepONet shows higher accuracy. We also find that for m-PCE, modest over-parameterization leads to better generalization, both within and outside of distribution, whereas aggressive over-parameterization leads to over-fitting. In contrast, an even highly over-parameterized DeepONet leads to better generalization for both smooth and non-smooth dynamics. Furthermore, we compare the performance of the above models with another operator learning model, the Fourier Neural Operator, and show that its over-parameterization also leads to better generalization. Our studies show that m-PCE can provide very good accuracy at very low training cost, whereas a highly over-parameterized DeepONet can provide better accuracy and robustness to noise but at higher training cost. In both methods, the inference cost is negligible.  ( 2 min )
    Conditional Synthetic Data Generation for Personal Thermal Comfort Models. (arXiv:2203.05242v1 [cs.LG])
    Personal thermal comfort models aim to predict an individual's thermal comfort response, instead of the average response of a large group. Recently, machine learning algorithms have proven to be having enormous potential as a candidate for personal thermal comfort models. But, often within the normal settings of a building, personal thermal comfort data obtained via experiments are heavily class-imbalanced. There are a disproportionately high number of data samples for the "Prefer No Change" class, as compared with the "Prefer Warmer" and "Prefer Cooler" classes. Machine learning algorithms trained on such class-imbalanced data perform sub-optimally when deployed in the real world. To develop robust machine learning-based applications using the above class-imbalanced data, as well as for privacy-preserving data sharing, we propose to implement a state-of-the-art conditional synthetic data generator to generate synthetic data corresponding to the low-frequency classes. Via experiments, we show that the synthetic data generated has a distribution that mimics the real data distribution. The proposed method can be extended for use by other smart building datasets/use-cases.  ( 2 min )
    Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA. (arXiv:2203.05095v1 [cs.AR])
    Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs. The generated temporal node embeddings outperform other methods in many downstream tasks. Real-world applications require high performance inference on real-time streaming dynamic graphs. However, these models usually rely on complex attention mechanisms to capture relationships between temporal neighbors. In addition, maintaining vertex memory suffers from intrinsic temporal data dependency that hinders task-level parallelism, making it inefficient on general-purpose processors. In this work, we present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs. The key modeling optimizations we propose include a light-weight method to compute attention scores and a related temporal neighbor pruning strategy to further reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage FPGA hardware. We replace the temporal sampler with an on-chip FIFO based hardware sampler and the time encoder with a look-up-table. We train our simplified models using knowledge distillation to ensure similar accuracy vis-\'a-vis the original model. Taking advantage of the model optimizations, we propose a principled hardware architecture using batching, pipelining, and prefetching techniques to further improve the performance. We also propose a hardware mechanism to ensure the chronological vertex updating without sacrificing the computation parallelism. We evaluate the performance of the proposed hardware accelerator on three real-world datasets.  ( 2 min )
    Transfer Learning as an Essential Tool for Digital Twins in Renewable Energy Systems. (arXiv:2203.05026v1 [cs.LG])
    Transfer learning (TL), the next frontier in machine learning (ML), has gained much popularity in recent years, due to the various challenges faced in ML, like the requirement of vast amounts of training data, expensive and time-consuming labelling processes for data samples, and long training duration for models. TL is useful in tackling these problems, as it focuses on transferring knowledge from previously solved tasks to new tasks. Digital twins and other intelligent systems need to utilise TL to use the previously gained knowledge and solve new tasks in a more self-reliant way, and to incrementally increase their knowledge base. Therefore, in this article, the critical challenges in power forecasting and anomaly detection in the context of renewable energy systems are identified, and a potential TL framework to meet these challenges is proposed. This article also proposes a feature embedding approach to handle the missing sensors data. The proposed TL methods help to make a system more autonomous in the context of organic computing.  ( 2 min )
    Improving Neural ODEs via Knowledge Distillation. (arXiv:2203.05103v1 [cs.CV])
    Neural Ordinary Differential Equations (Neural ODEs) construct the continuous dynamics of hidden units using ordinary differential equations specified by a neural network, demonstrating promising results on many tasks. However, Neural ODEs still do not perform well on image recognition tasks. The possible reason is that the one-hot encoding vector commonly used in Neural ODEs can not provide enough supervised information. We propose a new training based on knowledge distillation to construct more powerful and robust Neural ODEs fitting image recognition tasks. Specially, we model the training of Neural ODEs into a teacher-student learning process, in which we propose ResNets as the teacher model to provide richer supervised information. The experimental results show that the new training manner can improve the classification accuracy of Neural ODEs by 24% on CIFAR10 and 5% on SVHN. In addition, we also quantitatively discuss the effect of both knowledge distillation and time horizon in Neural ODEs on robustness against adversarial examples. The experimental analysis concludes that introducing the knowledge distillation and increasing the time horizon can improve the robustness of Neural ODEs against adversarial examples.  ( 2 min )
    Universal Regression with Adversarial Responses. (arXiv:2203.05067v1 [cs.LG])
    We provide algorithms for regression with adversarial responses under large classes of non-i.i.d. instance sequences, on general separable metric spaces, with provably minimal assumptions. We also give characterizations of learnability in this regression context. We consider universal consistency which asks for strong consistency of a learner without restrictions on the value responses. Our analysis shows that such objective is achievable for a significantly larger class of instance sequences than stationary processes, and unveils a fundamental dichotomy between value spaces: whether finite-horizon mean-estimation is achievable or not. We further provide optimistically universal learning rules, i.e., such that if they fail to achieve universal consistency, any other algorithm will fail as well. For unbounded losses, we propose a mild integrability condition under which there exist algorithms for adversarial regression under large classes of non-i.i.d. instance sequences. In addition, our analysis also provides a learning rule for mean-estimation in general metric spaces that is consistent under adversarial responses without any moment conditions on the sequence, a result of independent interest.  ( 2 min )
    Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models. (arXiv:2203.05104v1 [cs.LG])
    Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent. These findings seem counter-intuitive since in general neural networks are highly complex models. Why does a linear structure emerge when the networks become wide? In this work, we provide a new perspective on this "transition to linearity" by considering a neural network as an assembly model recursively built from a set of sub-models corresponding to individual neurons. In this view, we show that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse "weak" sub-models, none of which dominate the assembly.  ( 2 min )
    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition. (arXiv:2203.05008v1 [cs.CL])
    Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.  ( 2 min )
    Evaluating Proposed Fairness Models for Face Recognition Algorithms. (arXiv:2203.05051v1 [cs.CV])
    The development of face recognition algorithms by academic and commercial organizations is growing rapidly due to the onset of deep learning and the widespread availability of training data. Though tests of face recognition algorithm performance indicate yearly performance gains, error rates for many of these systems differ based on the demographic composition of the test set. These "demographic differentials" in algorithm performance can contribute to unequal or unfair outcomes for certain groups of people, raising concerns with increased worldwide adoption of face recognition systems. Consequently, regulatory bodies in both the United States and Europe have proposed new rules requiring audits of biometric systems for "discriminatory impacts" (European Union Artificial Intelligence Act) and "fairness" (U.S. Federal Trade Commission). However, no standard for measuring fairness in biometric systems yet exists. This paper characterizes two proposed measures of face recognition algorithm fairness (fairness measures) from scientists in the U.S. and Europe. We find that both proposed methods are challenging to interpret when applied to disaggregated face recognition error rates as they are commonly experienced in practice. To address this, we propose a set of interpretability criteria, termed the Functional Fairness Measure Criteria (FFMC), that outlines a set of properties desirable in a face recognition algorithm fairness measure. We further develop a new fairness measure, the Gini Aggregation Rate for Biometric Equitability (GARBE), and show how, in conjunction with the Pareto optimization, this measure can be used to select among alternative algorithms based on the accuracy/fairness trade-space. Finally, we have open-sourced our dataset of machine-readable, demographically disaggregated error rates. We believe this is currently the largest open-source dataset of its kind.  ( 2 min )
    PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks. (arXiv:2203.05126v1 [cs.LG])
    With the increasing abundance of pretrained models in recent years, the problem of selecting the best pretrained checkpoint for a particular downstream classification task has been gaining increased attention. Although several methods have recently been proposed to tackle the selection problem (e.g. LEEP, H-score), these methods resort to applying heuristics that are not well motivated by learning theory. In this paper we present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement. We first show how to derive PACTran metrics from the optimal PAC-Bayesian bound under the transfer learning setting. We then empirically evaluate three metric instantiations of PACTran on a number of vision tasks (VTAB) as well as a language-and-vision (OKVQA) task. An analysis of the results shows PACTran is a more consistent and effective transferability measure compared to existing selection methods.  ( 2 min )
    Multi-Task Adversarial Learning for Treatment Effect Estimation in Basket Trials. (arXiv:2203.05123v1 [cs.LG])
    Estimating treatment effects from observational data provides insights about causality guiding many real-world applications such as different clinical study designs, which are the formulations of trials, experiments, and observational studies in medical, clinical, and other types of research. In this paper, we describe causal inference for application in a novel clinical design called basket trial that tests how well a new drug works in patients who have different types of cancer that all have the same mutation. We propose a multi-task adversarial learning (MTAL) method, which incorporates feature selection multi-task representation learning and adversarial learning to estimate potential outcomes across different tumor types for patients sharing the same genetic mutation but having different tumor types. In our paper, the basket trial is employed as an intuitive example to present this new causal inference setting. This new causal inference setting includes, but is not limited to basket trials. This setting has the same challenges as the traditional causal inference problem, i.e., missing counterfactual outcomes under different subgroups and treatment selection bias due to confounders. We present the practical advantages of our MTAL method for the analysis of synthetic basket trial data and evaluate the proposed estimator on two benchmarks, IHDP and News. The results demonstrate the superiority of our MTAL method over the competing state-of-the-art methods.  ( 2 min )
    Detecting and Diagnosing Terrestrial Gravitational-Wave Mimics Through Feature Learning. (arXiv:2203.05086v1 [astro-ph.IM])
    As engineered systems grow in complexity, there is an increasing need for automatic methods that can detect, diagnose, and even correct transient anomalies that inevitably arise and can be difficult or impossible to diagnose and fix manually. Among the most sensitive and complex systems of our civilization are the detectors that search for incredibly small variations in distance caused by gravitational waves -- phenomena originally predicted by Albert Einstein to emerge and propagate through the universe as the result of collisions between black holes and other massive objects in deep space. The extreme complexity and precision of such detectors causes them to be subject to transient noise issues that can significantly limit their sensitivity and effectiveness. In this work, we present a demonstration of a method that can detect and characterize emergent transient anomalies of such massively complex systems. We illustrate the performance, precision, and adaptability of the automated solution via one of the prevalent issues limiting gravitational-wave discoveries: noise artifacts of terrestrial origin that contaminate gravitational wave observatories' highly sensitive measurements and can obscure or even mimic the faint astrophysical signals for which they are listening. Specifically, we demonstrate how a highly interpretable convolutional classifier can automatically learn to detect transient anomalies from auxiliary detector data without needing to observe the anomalies themselves. We also illustrate several other useful features of the model, including how it performs automatic variable selection to reduce tens of thousands of auxiliary data channels to only a few relevant ones; how it identifies behavioral signatures predictive of anomalies in those channels; and how it can be used to investigate individual anomalies and the channels associated with them.  ( 2 min )
    Connecting sufficient conditions for domain adaptation: source-guided uncertainty, relaxed divergences and discrepancy localization. (arXiv:2203.05076v1 [cs.LG])
    Recent advances in domain adaptation establish that requiring a low risk on the source domain and equal feature marginals degrade the adaptation's performance. At the same time, empirical evidence shows that incorporating an unsupervised target domain term that pushes decision boundaries away from the high-density regions, along with relaxed alignment, improves adaptation. In this paper, we theoretically justify such observations via a new bound on the target risk, and we connect two notions of relaxation for divergence, namely $\beta-$relaxed divergences and localization. This connection allows us to incorporate the source domain's categorical structure into the relaxation of the considered divergence, provably resulting in a better handling of the label shift case in particular.  ( 2 min )
    Optimal Methods for Risk Averse Distributed Optimization. (arXiv:2203.05117v1 [math.OC])
    This paper studies the communication complexity of risk averse optimization over a network. The problem generalizes the well-studied risk-neutral finite-sum distributed optimization problem and its importance stems from the need to handle risk in an uncertain environment. For algorithms in the literature, there exists a gap in communication complexities for solving risk-averse and risk-neutral problems. We propose two distributed algorithms, namely the distributed risk averse optimization (DRAO) method and the distributed risk averse optimization with sliding (DRAO-S) method, to close the gap. Specifically, the DRAO method achieves the optimal communication complexity by assuming a certain saddle point subproblem can be easily solved in the server node. The DRAO-S method removes the strong assumption by introducing a novel saddle point sliding subroutine which only requires the projection over the ambiguity set $P$. We observe that the number of $P$-projections performed by DRAO-S is optimal. Moreover, we develop matching lower complexity bounds to show that communication complexities of both DRAO and DRAO-S are not improvable. Numerical experiments are conducted to demonstrate the encouraging empirical performance of the DRAO-S method.  ( 2 min )
    Resource-Efficient Invariant Networks: Exponential Gains by Unrolled Optimization. (arXiv:2203.05006v1 [cs.CV])
    Achieving invariance to nuisance transformations is a fundamental challenge in the construction of robust and reliable vision systems. Existing approaches to invariance scale exponentially with the dimension of the family of transformations, making them unable to cope with natural variabilities in visual data such as changes in pose and perspective. We identify a common limitation of these approaches--they rely on sampling to traverse the high-dimensional space of transformations--and propose a new computational primitive for building invariant networks based instead on optimization, which in many scenarios provides a provably more efficient method for high-dimensional exploration than sampling. We provide empirical and theoretical corroboration of the efficiency gains and soundness of our proposed method, and demonstrate its utility in constructing an efficient invariant network for a simple hierarchical object detection task when combined with unrolled optimization. Code for our networks and experiments is available at https://github.com/sdbuch/refine.  ( 2 min )
    Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks. (arXiv:2203.05025v1 [cs.LG])
    Deploying Deep Neural Networks in low-power embedded devices for real time-constrained applications requires optimization of memory and computational complexity of the networks, usually by quantizing the weights. Most of the existing works employ linear quantization which causes considerable degradation in accuracy for weight bit widths lower than 8. Since the distribution of weights is usually non-uniform (with most weights concentrated around zero), other methods, such as logarithmic quantization, are more suitable as they are able to preserve the shape of the weight distribution more precise. Moreover, using base-2 logarithmic representation allows optimizing the multiplication by replacing it with bit shifting. In this paper, we explore non-linear quantization techniques for exploiting lower bit precision and identify favorable hardware implementation options. We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks and achieved accuracies on par with state-of-the-art floating point models for different tasks. We explored PoT weight encoding techniques and investigated hardware designs of MAC units for three different quantization schemes - uniform, PoT and Additive-PoT (APoT) - to show the increased efficiency when using the proposed approach. Eventually, the experiments showed that for low bit width precision, non-uniform quantization performs better than uniform, and at the same time, PoT quantization vastly reduces the computational complexity of the neural network.  ( 2 min )
    Sharing Generative Models Instead of Private Data: A Simulation Study on Mammography Patch Classification. (arXiv:2203.04961v1 [eess.IV])
    Early detection of breast cancer in mammography screening via deep-learning based computer-aided detection systems shows promising potential in improving the curability and mortality rates of breast cancer. However, many clinical centres are restricted in the amount and heterogeneity of available data to train such models to (i) achieve promising performance and to (ii) generalise well across acquisition protocols and domains. As sharing data between centres is restricted due to patient privacy concerns, we propose a potential solution: sharing trained generative models between centres as substitute for real patient data. In this work, we use three well known mammography datasets to simulate three different centres, where one centre receives the trained generator of Generative Adversarial Networks (GANs) from the two remaining centres in order to augment the size and heterogeneity of its training dataset. We evaluate the utility of this approach on mammography patch classification on the test set of the GAN-receiving centre using two different classification models, (a) a convolutional neural network and (b) a transformer neural network. Our experiments demonstrate that shared GANs notably increase the performance of both transformer and convolutional classification models and highlight this approach as a viable alternative to inter-centre data sharing.  ( 2 min )
    Learning to control from expert demonstrations. (arXiv:2203.05012v1 [eess.SY])
    In this paper, we revisit the problem of learning a stabilizing controller from a finite number of demonstrations by an expert. By first focusing on feedback linearizable systems, we show how to combine expert demonstrations into a stabilizing controller, provided that demonstrations are sufficiently long and there are at least $n+1$ of them, where $n$ is the number of states of the system being controlled. When we have more than $n+1$ demonstrations, we discuss how to optimally choose the best $n+1$ demonstrations to construct the stabilizing controller. We then extend these results to a class of systems that can be embedded into a higher-dimensional system containing a chain of integrators. The feasibility of the proposed algorithm is demonstrated by applying it on a CrazyFlie 2.0 quadrotor.  ( 2 min )
  • Open

    TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions. (arXiv:2110.02195v2 [cs.LG] UPDATED)
    We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both $v^\star$ and $q^\star$ lie in the linear span when restricted to the states reachable from the starting state. Recently, Weisz et al. (2021b) showed that under (ii) the minimax query complexity of any planning algorithm is at least exponential in the horizon $H$ or in the feature dimension $d$ when the size $A$ of the action set can be chosen to be exponential in $\min(d,H)$. On the other hand, for the setting (i), Weisz et al. (2021a) introduced TensorPlan, a planner whose query cost is polynomial in all relevant quantities when the number of actions is fixed. Among other things, these two works left open the question whether polynomial query complexity is possible when $A$ is subexponential in $\min(d,H)$. In this paper we answer this question in the negative: we show that an exponentially large lower bound holds when $A=\Omega(\min(d^{1/4},H^{1/2}))$, under either (i), (ii) or (iii). In particular, this implies a perhaps surprising exponential separation of query complexity compared to the work of Du et al. (2021) who prove a polynomial upper bound when (iii) holds for all states. Furthermore, we show that the upper bound of TensorPlan can be extended to hold under (iii) and, for MDPs with deterministic transitions and stochastic rewards, also under (ii).  ( 3 min )
    To Impute or not to Impute? Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v2 [stat.ML] UPDATED)
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Filament Plots for Data Visualization. (arXiv:2107.10869v3 [cs.HC] UPDATED)
    The efficiency of modern computer graphics allows us to explore collections of space curves simultaneously with "drag-to-rotate" interfaces. This inspires us to replace "scatterplots of points" with "scatterplots of curves" to simultaneously visualize relationships across an entire dataset. Since spaces of curves are infinite dimensional, scatterplots of curves avoid the "lossy" nature of scatterplots of points. In particular, if two points are close in a scatterplot of points derived from high-dimensional data, it does not generally follow that the two associated data points are close in the data space. Standard Andrews plots provide scatterplots of curves that perfectly preserve Euclidean distances, but simultaneous visualization of these graphs over an entire dataset produces visual clutter because graphs of functions generally overlap in 2D. We mitigate this visual clutter issue by constructing computationally inexpensive 3D extensions of Andrews plots. First, we construct optimally smooth 3D Andrews plots by considering linear isometries from Euclidean data spaces to spaces of planar parametric curves. We rigorously parametrize the linear isometries that produce (on average) optimally smooth curves over a given dataset. This parameterization of optimal isometries reveals many degrees of freedom, and (using recent results on generalized Gauss sums) we identify a particular member of this set which admits an asymptotic "tour" property that avoids certain local degeneracies as well. Finally, we construct unit-length 3D curves (filaments) by numerically solving Frenet-Serret systems given data from these 3D Andrews plots. We conclude with examples of filament plots for several standard datasets, illustrating how filament plots avoid visual clutter. Code and examples available at https://github.com/n8epi/filaments/ and https://n8epi.github.io/filaments/  ( 2 min )
    Sparse Multi-Reference Alignment : Phase Retrieval, Uniform Uncertainty Principles and the Beltway Problem. (arXiv:2106.12996v3 [math.ST] UPDATED)
    Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $\sigma$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $\sigma \gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $\sigma^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $\sigma^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that sparse signals exhibit an intermediate $\sigma^4$ sample complexity even in the classical MRA model. Further, we characterise the dependence of the estimation rate on the support size $s$ as $O_p(1)$ and $O_p(s^{3.5})$ in the dilute and moderate regimes of sparsity respectively. Our techniques have implications for the problem of crystallographic phase retrieval, indicating a certain local uniqueness for the recovery of sparse signals from their power spectrum. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the beltway problem from combinatorial optimization, and uniform uncertainty principles from harmonic analysis. Our techniques include a certain enhanced form of the probabilistic method, which might be of general interest in its own right.  ( 2 min )
    A Deep Variational Approach to Clustering Survival Data. (arXiv:2106.05763v3 [cs.LG] UPDATED)
    In this work, we study the problem of clustering survival data $-$ a challenging and so far under-explored task. We introduce a novel semi-supervised probabilistic approach to cluster survival data by leveraging recent advances in stochastic gradient variational inference. In contrast to previous work, our proposed method employs a deep generative model to uncover the underlying distribution of both the explanatory variables and censored survival times. We compare our model to the related work on clustering and mixture models for survival data in comprehensive experiments on a wide range of synthetic, semi-synthetic, and real-world datasets, including medical imaging data. Our method performs better at identifying clusters and is competitive at predicting survival times. Relying on novel generative assumptions, the proposed model offers a holistic perspective on clustering survival data and holds a promise of discovering subpopulations whose survival is regulated by different generative mechanisms.  ( 2 min )
    Supervised ML Solution for Band Assignment in Dual-Band Systems with Omnidirectional and Directional Antennas. (arXiv:1902.10890v2 [cs.LG] UPDATED)
    Many wireless networks, including 5G NR (New Radio) and future beyond 5G cellular systems, are expected to operate on multiple frequency bands. This paper considers the band assignment (BA) problem in dual-band systems, where the basestation (BS) chooses one of the two available frequency bands (centimeter-wave and millimeter-wave bands) to communicate with the user equipment (UE). While the millimeter-wave band might offer higher data rate, there is a significant probability of outage during which the communication should be carried on the (more reliable) centimeter-wave band. With mobility, the BA can be perceived as a sequential problem, where the BS uses previously observed information to predict the best band for a future time step. We formulate the BA as a binary classification problem and propose supervised Machine Learning (ML) solutions. We study the problem when both the BS and the UE use (i) omnidirectional antennas and (ii) both use directional antennas. In the omnidirectional case, we derive analytical benchmark solutions based on the Gaussian Process (GP) assumption for the inter-band shadow fading. In the directional case, where the labeling is shown to be complex, we propose an efficient labeling approach based on the Viterbi Algorithm (VA). We compare the performances for two channel models: (i) a stochastic channel and (ii) a ray-tracing based channel.  ( 2 min )
    Learning from Noisy Labels with Deep Neural Networks: A Survey. (arXiv:2007.08199v7 [cs.LG] UPDATED)
    Deep learning has achieved remarkable success in numerous domains with help from large amounts of big data. However, the quality of data labels is a concern because of the lack of high-quality labels in many real-world scenarios. As noisy labels severely degrade the generalization performance of deep neural networks, learning from noisy labels (robust training) is becoming an important task in modern deep learning applications. In this survey, we first describe the problem of learning with label noise from a supervised learning perspective. Next, we provide a comprehensive review of 62 state-of-the-art robust training methods, all of which are categorized into five groups according to their methodological difference, followed by a systematic comparison of six properties used to evaluate their superiority. Subsequently, we perform an in-depth analysis of noise rate estimation and summarize the typically used evaluation methodology, including public noisy datasets and evaluation metrics. Finally, we present several promising research directions that can serve as a guideline for future studies. All the contents will be available at https://github.com/songhwanjun/Awesome-Noisy-Labels.  ( 2 min )
    deepregression: a Flexible Neural Network Framework for Semi-Structured Deep Distributional Regression. (arXiv:2104.02705v3 [stat.ML] UPDATED)
    In this paper we describe the implementation of semi-structured deep distributional regression, a flexible framework to learn conditional distributions based on the combination of additive regression models and deep networks. Our implementation encompasses (1) a modular neural network building system based on the deep learning library \pkg{TensorFlow} for the fusion of various statistical and deep learning approaches, (2) an orthogonalization cell to allow for an interpretable combination of different subnetworks, as well as (3) pre-processing steps necessary to set up such models. The software package allows to define models in a user-friendly manner via a formula interface that is inspired by classical statistical model frameworks such as \pkg{mgcv}. The packages' modular design and functionality provides a unique resource for both scalable estimation of complex statistical models and the combination of approaches from deep learning and statistics. This allows for state-of-the-art predictive performance while simultaneously retaining the indispensable interpretability of classical statistical models.  ( 2 min )
    A Locally Adaptive Interpretable Regression. (arXiv:2005.03350v3 [stat.ML] UPDATED)
    Machine learning models with both good predictability and high interpretability are crucial for decision support systems. Linear regression is one of the most interpretable prediction models. However, the linearity in a simple linear regression worsens its predictability. In this work, we introduce a locally adaptive interpretable regression (LoAIR). In LoAIR, a metamodel parameterized by neural networks predicts percentile of a Gaussian distribution for the regression coefficients for a rapid adaptation. Our experimental results on public benchmark datasets show that our model not only achieves comparable or better predictive performance than the other state-of-the-art baselines but also discovers some interesting relationships between input and target variables such as a parabolic relationship between CO2 emissions and Gross National Product (GNP). Therefore, LoAIR is a step towards bridging the gap between econometrics, statistics, and machine learning by improving the predictive ability of linear regression without depreciating its interpretability.  ( 2 min )
    Differentially Private Learning Needs Hidden State (Or Much Faster Convergence). (arXiv:2203.05363v1 [stat.ML])
    Differential privacy analysis of randomized learning algorithms typically relies on composition theorems, where the implicit assumption is that the internal state of the iterative algorithm is revealed to the adversary. However, by assuming hidden states for DP algorithms (when only the last-iterate is observable), recent works prove a converging privacy bound for noisy gradient descent (on strongly convex smooth loss function) that is significantly smaller than composition bounds after $O(1/\text{step-size})$ epochs. In this paper, we extend this hidden-state analysis to the noisy mini-batch stochastic gradient descent algorithms on strongly-convex smooth loss functions. We prove converging R\'enyi DP bounds under various mini-batch sampling schemes, such as "shuffle and partition" (which are used in practical implementations of DP-SGD) and "sampling without replacement". We prove that, in these settings, our privacy bound is much smaller than the composition bound for training with a large number of iterations (which is the case for learning from high-dimensional data). Our converging privacy analysis, thus, shows that differentially private learning, with a tight bound, needs hidden state privacy analysis or a fast convergence. To complement our theoretical results, we run experiment on training classification models on MNIST, FMNIST and CIFAR-10 datasets, and observe a better accuracy given fixed privacy budgets, under the hidden-state analysis.  ( 2 min )
    Universal Regression with Adversarial Responses. (arXiv:2203.05067v1 [cs.LG])
    We provide algorithms for regression with adversarial responses under large classes of non-i.i.d. instance sequences, on general separable metric spaces, with provably minimal assumptions. We also give characterizations of learnability in this regression context. We consider universal consistency which asks for strong consistency of a learner without restrictions on the value responses. Our analysis shows that such objective is achievable for a significantly larger class of instance sequences than stationary processes, and unveils a fundamental dichotomy between value spaces: whether finite-horizon mean-estimation is achievable or not. We further provide optimistically universal learning rules, i.e., such that if they fail to achieve universal consistency, any other algorithm will fail as well. For unbounded losses, we propose a mild integrability condition under which there exist algorithms for adversarial regression under large classes of non-i.i.d. instance sequences. In addition, our analysis also provides a learning rule for mean-estimation in general metric spaces that is consistent under adversarial responses without any moment conditions on the sequence, a result of independent interest.  ( 2 min )
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Regression. (arXiv:2203.05400v1 [math.ST])
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, such as the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood and cross-validation estimates of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimates cannot remain below $\nu_0$ as more data are obtained. These results are based on a general theorem, proved using reproducing kernel Hilbert space techniques, about sets of values the parameter estimates cannot take and approximation theory in Sobolev spaces.  ( 2 min )
    A General Pairwise Comparison Model for Extremely Sparse Networks. (arXiv:2002.08853v3 [stat.ML] UPDATED)
    Statistical inference using pairwise comparison data is an effective approach to analyzing large-scale sparse networks. In this paper, we propose a general framework to model the mutual interactions in a network, which enjoys ample flexibility in terms of model parametrization. Under this setup, we show that the maximum likelihood estimator for the latent score vector of the subjects is uniformly consistent under a near-minimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. Our analysis utilizes a novel chaining technique and illustrates an important connection between graph topology and model consistency. Our results guarantee that the maximum likelihood estimator is justified for estimation in large-scale pairwise comparison networks where data are asymptotically deficient. Simulation studies are provided in support of our theoretical findings.  ( 2 min )
    TiSAT: Time Series Anomaly Transformer. (arXiv:2203.05167v1 [cs.LG])
    While anomaly detection in time series has been an active area of research for several years, most recent approaches employ an inadequate evaluation criterion leading to an inflated F1 score. We show that a rudimentary Random Guess method can outperform state-of-the-art detectors in terms of this popular but faulty evaluation criterion. In this work, we propose a proper evaluation metric that measures the timeliness and precision of detecting sequential anomalies. Moreover, most existing approaches are unable to capture temporal features from long sequences. Self-attention based approaches, such as transformers, have been demonstrated to be particularly efficient in capturing long-range dependencies while being computationally efficient during training and inference. We also propose an efficient transformer approach for anomaly detection in time series and extensively evaluate our proposed approach on several popular benchmark datasets.  ( 2 min )
    Bias-variance decomposition of overparameterized regression with random linear features. (arXiv:2203.05443v1 [stat.ML])
    In classical statistics, the bias-variance trade-off describes how varying a model's complexity (e.g., number of fit parameters) affects its ability to make accurate predictions. According to this trade-off, optimal performance is achieved when a model is expressive enough to capture trends in the data, yet not so complex that it overfits idiosyncratic features of the training data. Recently, it has become clear that this classic understanding of the bias-variance must be fundamentally revisited in light of the incredible predictive performance of "overparameterized models" -- models that avoid overfitting even when the number of fit parameters is large enough to perfectly fit the training data. Here, we present results for one of the simplest examples of an overparameterized model: regression with random linear features (i.e. a two-layer neural network with a linear activation function). Using the zero-temperature cavity method, we derive analytic expressions for the training error, test error, bias, and variance. We show that the linear random features model exhibits three phase transitions: two different transitions to an interpolation regime where the training error is zero, along with an additional transition between regimes with large bias and minimal bias. Using random matrix theory, we show how each transition arises due to small nonzero eigenvalues in the Hessian matrix. Finally, we compare and contrast the phase diagram of the random linear features model to the random nonlinear features model and ordinary regression, highlighting the new phase transitions that result from the use of linear basis functions.  ( 2 min )
    Discrepancy Modeling Framework: Learning missing physics, modeling systematic residuals, and disambiguating between deterministic and random effects. (arXiv:2203.05164v1 [stat.ML])
    Physics-based and first-principles models pervade the engineering and physical sciences, allowing for the ability to model the dynamics of complex systems with a prescribed accuracy. The approximations used in deriving governing equations often result in discrepancies between the model and sensor-based measurements of the system, revealing the approximate nature of the equations and/or the signal-to-noise ratio of the sensor itself. In modern dynamical systems, such discrepancies between model and measurement can lead to poor quantification, often undermining the ability to produce accurate and precise control algorithms. We introduce a discrepancy modeling framework to resolve deterministic model-measurement mismatch with two distinct approaches: (i) by learning a model for the evolution of systematic state-space residual, and (ii) by discovering a model for the missing deterministic physics. Regardless of approach, a common suite of data-driven model discovery methods can be used. Specifically, we use four fundamentally different methods to demonstrate the mathematical implementations of discrepancy modeling: (i) the sparse identification of nonlinear dynamics (SINDy), (ii) dynamic mode decomposition (DMD), (iii) Gaussian process regression (GPR), and (iv) neural networks (NN). The choice of method depends on one's intent for discrepancy modeling, as well as quantity and quality of the sensor measurements. We demonstrate the utility and suitability for both discrepancy modeling approaches using the suite of data-driven modeling methods on three dynamical systems under varying signal-to-noise ratios. We compare reconstruction and forecasting accuracies and provide detailed comparatives, allowing one to select the appropriate approach and method in practice.  ( 2 min )
    A low-rank ensemble Kalman filter for elliptic observations. (arXiv:2203.05120v1 [physics.flu-dyn])
    We propose a regularization method for ensemble Kalman filtering (EnKF) with elliptic observation operators. Commonly used EnKF regularization methods suppress state correlations at long distances. For observations described by elliptic partial differential equations, such as the pressure Poisson equation (PPE) in incompressible fluid flows, distance localization cannot be applied, as we cannot disentangle slowly decaying physical interactions from spurious long-range correlations. This is particularly true for the PPE, in which distant vortex elements couple nonlinearly to induce pressure. Instead, these inverse problems have a low effective dimension: low-dimensional projections of the observations strongly inform a low-dimensional subspace of the state space. We derive a low-rank factorization of the Kalman gain based on the spectrum of the Jacobian of the observation operator. The identified eigenvectors generalize the source and target modes of the multipole expansion, independently of the underlying spatial distribution of the problem. Given rapid spectral decay, inference can be performed in the low-dimensional subspace spanned by the dominant eigenvectors. This low-rank EnKF is assessed on dynamical systems with Poisson observation operators, where we seek to estimate the positions and strengths of point singularities over time from potential or pressure observations. We also comment on the broader applicability of this approach to elliptic inverse problems outside the context of filtering.  ( 2 min )
    Deep Regression Ensembles. (arXiv:2203.05417v1 [stat.ML])
    We introduce a methodology for designing and training deep neural networks (DNN) that we call "Deep Regression Ensembles" (DRE). It bridges the gap between DNN and two-layer neural networks trained with random feature regression. Each layer of DRE has two components, randomly drawn input weights and output weights trained myopically (as if the final output layer) using linear ridge regression. Within a layer, each neuron uses a different subset of inputs and a different ridge penalty, constituting an ensemble of random feature ridge regressions. Our experiments show that a single DRE architecture is at par with or exceeds state-of-the-art DNN in many data sets. Yet, because DRE neural weights are either known in closed-form or randomly drawn, its computational cost is orders of magnitude smaller than DNN.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v1 [cs.LG])
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    Spatially-Varying Bayesian Predictive Synthesis for Flexible and Interpretable Spatial Prediction. (arXiv:2203.05197v1 [stat.ME])
    Spatial data are characterized by their spatial dependence, which is often complex, non-linear, and difficult to capture with a single model. Significant levels of model uncertainty -- arising from these characteristics -- cannot be resolved by model selection or simple ensemble methods, as performances are not homogeneous. We address this issue by proposing a novel methodology that captures spatially-varying model uncertainty, which we call spatial Bayesian predictive synthesis. Our proposal is defined by specifying a latent factor spatially-varying coefficient model as the synthesis function, which enables model coefficients to vary over the region to achieve flexible spatial model ensembling. Two MCMC strategies are implemented for full uncertainty quantification, as well as a variational inference strategy for fast point inference. We also extend the estimations strategy for general responses. A finite sample theoretical guarantee is given for the predictive performance of our methodology, showing that the predictions are exact minimax. Through simulation examples and two real data applications, we demonstrate that our proposed spatial Bayesian predictive synthesis outperforms standard spatial models and advanced machine learning methods, in terms of predictive accuracy, while maintaining interpretability of the prediction mechanism.  ( 2 min )
    Neural network approximation and estimation of classifiers with classification boundary in a Barron class. (arXiv:2011.09363v2 [math.FA] UPDATED)
    We prove bounds for the approximation and estimation of certain binary classification functions using ReLU neural networks. Our estimation bounds provide a priori performance guarantees for empirical risk minimization using networks of a suitable size, depending on the number of training samples available. The obtained approximation and estimation rates are independent of the dimension of the input, showing that the curse of dimensionality can be overcome in this setting; in fact, the input dimension only enters in the form of a polynomial factor. Regarding the regularity of the target classification function, we assume the interfaces between the different classes to be locally of Barron-type. We complement our results by studying the relations between various Barron-type spaces that have been proposed in the literature. These spaces differ substantially more from each other than the current literature suggests.  ( 2 min )

  • Open

    [D] How to extract specific words from text
    Is there any pretrained model that can say pull out a sport from a string such as the string "I want to go play soccer" and have it get soccer? Obviously I could do this without a model but I want it to be able to work for other things as well. Thank you! submitted by /u/Mesablip [link] [comments]  ( 1 min )
    [D] what is the most useful ML application for geospatial industry?
    what is the most useful ML application for geospatial industry? submitted by /u/SamanthaLong [link] [comments]
    [D] Model that can parse data the same way training data has
    Hi everyone. I'm looking for either a pretrained model or some kind of architecture that can parse a tweet based on how the training data looks. For example, if the training data includes a tweet from an esports team saying: "Team A beats Team B 4 - 0 in the Overwatch finals" and has columns for team 1 name, team 2 name, score, game name, etc. (all pulled from the tweet), how can I train a model to parse up the tweet into these columns for me? Thank you! submitted by /u/Mesablip [link] [comments]  ( 1 min )
    [P] Multilabel classifier substantially underperforms individual binary classifiers
    Hi, I originally trained multiple individual binary classifiers for each label of an image. Then, I realized I can train a single multilabel model for this task. I used binary_cross_entropy loss for this instead of categorical_cross_entropy, but besides for changing the loss function I made no major changes. However, I find that the multilabel classifier still substantially underperforms the individual label classifiers. Is this common and to be expected? Are there any tricks I am missing? Thanks! submitted by /u/rsandler [link] [comments]  ( 1 min )
    [D] Can neural networks be viewed as dynamical systems?
    Can one compare neural networks and dynamical systems? If so, is such comparison useful? For instance, can something like a space defined by activations/values of every single parameter in a NN be thought of as the phase space? Is there any interesting literature exploring that idea? submitted by /u/sarfins [link] [comments]  ( 1 min )
    [D] Machine Learning biggest unsolved problems
    What are the biggest unsolved theoretical and applied problems in machine learning and, particularly, in deep learning? submitted by /u/SamanthaLong [link] [comments]  ( 1 min )
    [N] Advanced machine learning and deep learning tutorial
    Dears. I have more than 5 years experience in machine learning and deep learning and recently have created a Youtube channel. https://www.youtube.com/channel/UCn9Rujwh7SfHF2RRvy_ks-g In the channel, I first explain a paper, then I implement/explain the code . Please join, leave a comment, and share with your friends. You can also suggest any paper and I will add it to my list. I am constantly trying to improve contents and quality. Thanks. submitted by /u/MRMohebian [link] [comments]  ( 1 min )
    Why not use alternating minimization for training neural networks? [R]
    Neural networks are usually trained by calculating gradients using backprop and then performing gradient descent. I am not sure what are difficulties in training a neural network using alternating minimization? It seems that the main idea in neural networks is that there are many parameters, and updates can be performed in parallel, thus speeding up the process. In this process, we lose the theoretical guarantees of alternating minimization, but we gain on speed. Is that right? What is the difference from a theoretical and optimization point of view of updating everything at once and serially? Alternating minimization- In this optimization strategy, the idea is to fix everything except one variable and perform gradient descent for it, and do this procedure alternatively for all variables till some convergence criteria is met. Reference: https://www.mit.edu/~rakhlin/6.883/lectures/lecture07.pdf Also, what is the order of computing gradient using backprop vs. naive gradient computation in neural networks? submitted by /u/dushyantsahoo [link] [comments]  ( 1 min )
    [N] PyTorch 1.11 released with tf.data and JAX inspired libraries TorchData and functorch
    https://pytorch.org/blog/pytorch-1.11-released/ As a longtime TensorFlow user I've been meaning to switch to either JAX or PyTorch, thus I'm pretty intrigued by this. In the past I've been having a hard time giving up tf.data's pretty elegant fluent interface for performant I/O and data preprocessing. Has anyone tried the new PyTorch equivalent? How does TorchData stack up? And are there more things in JAX that functorch cannot express or will both autograd engines hit feature parity now-ish? submitted by /u/carlthome [link] [comments]  ( 1 min )
    [D] Do GANs learn manifolds?
    Hello, I'm getting started with some literature on GANs and I'm wondering what kind of latent structure do they learn. For example how does the latent space differ from that of VAEs or MAEs? (masked auto encoders) Do they learn manifolds? What's the difference? Also in the context of styleGAN. submitted by /u/jim_from_truckistan [link] [comments]  ( 1 min )
    [D] ICML 2022 notification of phase 1 rejection
    Today is supposed to be phase 1 rejection notification. For papers that will move to phase 2, will there be a notification, or should we just pray that we do not receive an email from CMT? Thanks. submitted by /u/fixed-point-learning [link] [comments]  ( 1 min )
    [N] Article series: Deployment of Deep Learning models on Genesis Cloud - Tutorials & Benchmarks
    ​ https://preview.redd.it/4icndo16fsm81.png?width=2036&format=png&auto=webp&s=486a5d676222a279ed45b45ec187028b21dcabcc We are proud to introduce our new article series that will guide you on how to run state of the art deep learning models on Genesis Cloud infrastructure. These articles will be initially published as blog posts and will be added to our knowledge base after their release. Please note: The order of the articles is important as articles are written as a series and information contained in the initial articles might be required for understanding the subsequent articles. In this series of articles we will use 1x RTX 3080 instance type on Genesis Cloud (our recommended GPU for inference use) and showcase four (4) different deployment strategies for deep learning inference usi…  ( 2 min )
    [D] Dataset Normalization using Shrink-and-Perturb Warm Start
    I have trained a large deep learning model on Dataset A. It works well on data similar to dataset A, but doesn't generalize that well as real world problems are more complex and varied than Dataset A. Since training, I have compiled a new dataset, Dataset B, that has significantly more data and a more realistic data distribution. I would like to re-train the model using Dataset B. My understanding is warm-starting the model can hurt generalization, but since the model is large, I would like to avoid starting training from scratch. For this reason, I am thinking of doing a warm-start using "Shrink and Perturb". overview: https://jamesmccaffrey.wordpress.com/2020/12/14/neural-network-warm-start-shrink-and-perturb/ paper: https://arxiv.org/pdf/1910.08475.pdf Dataset A's features have been normalized using the mean and variance of the features calculated on the dataset. To get the best results, should I normalize Dataset B using the same statistics (mean and variance calculated from Dataset A) or normalize using new statistics calculated from Dataset B. My assumption is that the best results will be obtained by training from scratch using Dataset B normalized using statistics from Dataset B, but since I may have to do this process many times, being able to warm-start and use the Shrink and Perturb method is preferred - I'm just not sure whether I should be maintaining the normalization statistics from the initial dataset for all dataset normalization, or re-calculate the statistics for each new Dataset. My assumption is that the same normalization statistics should be used, but will this hurt the generalization of the model in any way? Any thoughts? Thanks! submitted by /u/csmrh [link] [comments]  ( 1 min )
    [D] Does the RTX 3060 work reasonably well for deep learning?
    Yes, it's a low end chip, but the 12GB make it quite attractive. It might not run fast, but it'll be able to run things that won't run on the 8GB cards, so if the 10/12GB cards are out of my budget, it seems like an option worth considering. However, when it came out, people complained about poor cuda support - was that just close to release and has since been fixed? Another issue besides the low end chip is the low bandwidth of the memory. Does anyone have any experience with it or any cuda/DL benchmarks they can share? Thanks! submitted by /u/AuspiciousApple [link] [comments]  ( 3 min )
    [R] Snorkel weak labeling for NER. When a token does not fall under any of the class labels or abstains in NER?
    I have a program that labels a sequence of words using ontologies. I have labeling functions for class negative, class positive, and class abstain. Should I keep the word unlabeled if it does not fall under any of these class labels and ignore it or should I force-label them under either class negative or abstain? I will be grateful for any hints or help. submitted by /u/freaky_eater [link] [comments]  ( 1 min )
    [D] Doing Machine Learning with a vibrating metal plate!
    I recently came across this extremely cool class of AI systems that uses physical transformations in hardware directly to train. A vibrating metal plate trained using this method reached 87% accuracy for the popular MNIST handwritten digit classification task. Training is done using the Physics Aware Training - training data is input to the physical system alongside trainable parameters -> the physical system applies its transformation to produce an output -> the output is compared with the target output to calculate error -> then a differentiable digital model estimates the gradient loss with respect to controllable parameters -> finally, the parameters are updated based on the inferred gradient. By repeating the process multiple times, the error is reduced submitted by /u/ifcarscouldspeak [link] [comments]  ( 1 min )
    [R] Using KRLS-T instead of Gaussian Process for regression for a MPC
    I am working in autonomous racing in a non cooperative scenario; I would use the KRLS-T to infer the model of the vehicle ahead of me and then use an MPC. Do you think it could bring any advantage in comparison to the gaussian process regression? Its complexity would be O(n2) and it could be brought down using random fourier feature and such, whereas the gaussian process complexity is O(n3). If not KRLS-T do you have any other technique to reccomend? submitted by /u/enricocco [link] [comments]  ( 1 min )
    [D] Trouble with training convergence with composite loss function
    I'm training a network with a loss function comprising of two different losses, Loss = Loss1 + alpha*Loss2. What I'm trying to achieve is to let the network learn a representation that allows it to do well on two different tasks. When I train with just either of the loss functions alone, the train error is able to decrease steadily. However, with the combined loss function, the training loss is unable to decrease. Does anyone have experience on how to train with composite loss functions? submitted by /u/jucheonsun [link] [comments]  ( 1 min )
    [D] Apple Claims M1 Ultra Beats RTX 3090. What's the State of Training with M1?
    Apple claims that M1 Ultra can beat the performance of RTX 3090. Has anyone successfully train a model with M1 chip? If so what library? That kind of performance with 128gb memory could possibly be a game changer if there's a natively optimized library for M1? submitted by /u/jl303 [link] [comments]  ( 3 min )
    [R] How to make CUDA libraries more performant?
    Hi, I am trying to understand some things about CUDA (I am a noob when it comes to CUDA). Consider any library provided by Nvidia like cuDNN and consider some function "F" defined in this library. From my understanding, the implementation of "F" is architecture-dependent i.e. it might be implemented differently on Ampere than on previous architectures to get the most performance. Am I correct in saying this? (and is this the reason why I should prefer Nvidia libraries for all the possible cases) If the implementation of "F" is architecture-dependent, then doesn't it create performance issues. As you basically have to use if-else conditions to use the appropriate implementation for the installed GPU? If I download cuDNN, then from my understanding the exact GPU I have does not matter, …  ( 2 min )
    [P] Colab-xterm: Make you possible to monitor gpu/cpu usage in colab free plan
    Do you use colab? Have you ever wished to watch the cpu/gpu usage while training? In colab paid version (pro or pro+), you can use terminal to see nvidia-smi or htop to monitor the system metrics. But it is not available in free plan. colab-xterm is a tool that allows you to open a terminal in a cell. Just copy and paste the following code in a colab cell and run any command you need !pip install colab-xterm %load_ext colabxterm %xterm tmux + htop + vim + nvidia-smi I am the author of the project. If you think it is useful, please give me a start.https://github.com/InfuseAI/colab-xterm submitted by /u/popcornylu [link] [comments]  ( 1 min )
  • Open

    Classification of pig calls produced from birth to slaughter according to their emotional valence and context of production
    submitted by /u/nickb [link] [comments]
    Which are the most influential papers on Artificial Neural Networks that you would recommend for a beginner?
    The title mostly covers it. are there any papers you personally would recommend? Thanks all. submitted by /u/progressiveavocado [link] [comments]  ( 1 min )
    How to learn financial application of neural networking?
    Especially in the context of risk, how would neural networking be used? What are some good resources to read/watch videos on? submitted by /u/goodasseyebrows [link] [comments]
    Simple NN AI trying to land Lunar Lander
    submitted by /u/hotcodist [link] [comments]
  • Open

    Last Week in AI: AI-Developed Drug, AI Beats Moore’s Law, Russia’s AI Army, Baidu’s Robotaxis
    submitted by /u/regalalgorithm [link] [comments]
    So, here is a question...
    I'm noticing an uptick in Artificial voices for Radio Announcing. What's Odd, is that sometimes it's even the usual guy, who's still living. but he's CLEARLY an AI Voice, as he..stutt...ers his..speEECH in Oddd places. ​ So, how do they create these things? Can anyone take a voice and make new content out of it? I'd have fun taking James Avery's Shredder voice, and redubbing episodes that he missed :). submitted by /u/HorribleEmulator [link] [comments]  ( 1 min )
    Researchers From the University of Hamburg Propose A Machine Learning Model, Called ‘LipSound2’, That Directly Predicts Speech Representations From Raw Pixels
    The purpose of the paper presented in this article is to reconstruct speech only based on sequences of images of talking people. The generation of speech from silent videos can be used for many applications: for instance, silent visual input methods used in public environments for privacy protection or understanding speech in surveillance videos. The main challenge in speech reconstruction from visual information is that human speech is produced not only through observable mouth and face movements but also through lips, tongue, and internal organs like vocal cords. Furthermore, it is hard to visually distinguish phonemes like ‘v’ and ‘f’ only through mouth and face movements. This paper leverages the natural co-occurrence of audio and video streams to pre-train a video-to-audio speech reconstruction model through self-supervision. Continue Reading my Summary on this Paper Paper: https://arxiv.org/pdf/2112.04748.pdf https://preview.redd.it/qwimpj0fvsm81.png?width=1079&format=png&auto=webp&s=c69feaa4682098142cdae75e246c228622dede63 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Is AI capable of categorizing web pages?
    Hi everyone. I've been messing with at least image recognition software when it comes to AI, but not so much with things like machine learning. Before I get too deep in the hole, I'd like to understand the current capabilities of AI and find out of it's something that even sounds feasible. There are two main things I'm interested in: Can AI categorize web pages based on content? For example, can it distinguish between a page meant for shopping vs one meant for education. Can AI determine the purpose of a web page? For example, can it determine if a page is meant for ecommerce vs one meant for getting sign ups or leads? Those are the subjects I'm interested in at high level. If they sound like feasible projects, I'd also be curious as to what path and branch of AI I should explore. Thanks in advance. submitted by /u/kenshinx9 [link] [comments]  ( 2 min )
    What are AIs that are good at editing images of faces (it does not matter how the AI is editing the face)?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    Train an AI to recognize all the clothes my brand produces?
    This might be a silly question whether impossible or simple, but I was wondering if it was possible to train an AI to recognize the clothing we offer to avoid having to manually add metadata to everything. I work at a uniform company whose products stay active for a while and retire them infrequently. I don't know much about ai, but it seems like this is the way to go if I want to automatically add metadata to an image. submitted by /u/Brentacula [link] [comments]  ( 1 min )
    GFP-GAN explained: Impressive face restoration model!
    submitted by /u/OnlyProggingForFun [link] [comments]
    Can I buy an agent yet?
    Are there AI agent products that one could purchase and configure without programming ? For example, let's say I am a dentist specializing in regenerative treatments. I'm too busy to read every published paper of relevance to what I do. It would really help to have an agent that scours multiple sources like PubMed and others for every clinical trial, discovery, review, etc. that I might need to read and select a reconfigured number of suggestions. As I rate each of the agent's suggestions it's suggestions improve until eventually it picks what I would if I were doing it Are we there yet? submitted by /u/locabuena [link] [comments]  ( 1 min )
    I'm even more amazed by alpha-go performance beating 3000 years of human accumulated knowledge with those 2 facts
    submitted by /u/grumpyfrench [link] [comments]  ( 2 min )
    Career Advice needed
    submitted by /u/immoral_writer [link] [comments]  ( 1 min )
    AI Generated "Golden Tree"
    submitted by /u/Recent_Coffee_2551 [link] [comments]
  • Open

    Need help implementing RL
    I’m new to reinforcement learning. I am trying to implement this git repo (https://github.com/aryadas98/cs359-networking-project) using RL (q-learning). Any sort of help/references would be greatly appreciated. Thanks submitted by /u/Phoenix17008 [link] [comments]
    Reinforcement learning for continuous action spaces
    Hellooo! Actually I am applying DDPG algorithm in a problem with three action spaces. all of them are defined as: self.action_space = spaces.Box(low=0, high=+1, shape=(3,), dtype=np.float32) All these actions will be used to calculate the global action at time t. However, I am having a problem in the DDPG outputs. At each time step, the agent outputs roughly the same actions which sum to nearly 1 (whatever the state). I´ have already checked that the environment is working (with stable_baselines3.common.env_checker) Also, I have already tested many hyperparameters values but in vain. I am waiting that the agent outputs a diffirent value of the three actions each time we have a different state but always the agent outputs roughly the same actions values such that sum(action1 + action2 + action3) = 1 (nearly) I don´t know what may constrain this in my code? the second point is that the actions start in something like [0.5xxxx, 0,4xxxxx, 0.5xxx] and end in [0.00xxxx, 0.00xxxx, 0.99xxxx] ​ https://preview.redd.it/mno0baf89sm81.png?width=486&format=png&auto=webp&s=9f6a49e05f7bb0b7a5a966c7995ee0c4814dd92e https://preview.redd.it/q3fwpfpz9sm81.png?width=1278&format=png&auto=webp&s=e1eb79139f578dc4f43504ff4d51b5cc8372aa4e ​ https://preview.redd.it/vbats6y2asm81.png?width=865&format=png&auto=webp&s=c82a0b6400e7fe56c587828e7377f95c979ac7c6 ​ https://preview.redd.it/uwun2h76asm81.png?width=1266&format=png&auto=webp&s=0a84a364da5d339d01ab26d68d41bd2197cb24f5 ​ https://preview.redd.it/sa21suceasm81.png?width=1270&format=png&auto=webp&s=f2e34cf61d5351d32ed6a954a0c32ddae2174880 ​ https://preview.redd.it/jnyzakahasm81.png?width=943&format=png&auto=webp&s=716fc53da05f946a6b009402929f2a73e10d8e2f ​ https://preview.redd.it/apizat8wasm81.png?width=1386&format=png&auto=webp&s=77918a6165a48de413dd585f25243bb206478705 submitted by /u/GuavaAgreeable208 [link] [comments]  ( 3 min )
    A3C Algorithm on gym's Bipedal Walker is not converging.
    Hi, I am trying to implement A3C algorithm on OpenAI gym Bipedal Walker. I am newbie to reinforcement learning so I took reference from Pieter Abbeel's video lectures on Intro to Deep RL. Please suggest me if something is wrong with the implementation of A3C or on why this algorithm is not converging. My code is given below: ``` from tensorflow.keras.layers import Dense, Input, Lambda, Flatten from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam import tensorflow_probability as tfp from collections import deque import os import gym import numpy as np import wandb import random max_episodes = 10_000 t_max = 2_000 initial_learning_rate = 0.0005 lr_change = 0.000001 memory_size = 300 discount = 0.9995 train_freq = 30 feed_memory = 3 env = gym.make("…  ( 3 min )
    Why is Deep Learning used with RL? DL is hard to explain and seems to be unintuitive in the areas
    *EDIT: in the areas which it is used in within RL I mean people use DL for autonomous cars, and I have to wonder if doing that isn't stupid and outright irresponsible. DL is at best explainable, and not even that much. Using it with RL within autonomous vehicles seems like a mistake. I know, it's a continuous space, it's too hard to use something else because of how large those spaces are, but even so. These are cars we're talking about, they make the wrong turn, make the wrong decision, and they could very well kill someone. How exactly is anyone going to explain how that happened, if DL models are really black boxes? I mean, are people just going to look at the activation atlases and figure out which neuron activated? Those things and other ways of using explainability are really weak in my opinion, they're heuristics trying to explain other heuristics. So why is there this obsession to put DL into RL? Can't researchers look for other methods? RL has a foundation on MDPs, it can be proven and verified mathematically, DL can't. Why on Earth would anybody want to mix this all in with RL? I think this is a mistake, and researchers should really look towards something else other than DL. In my opinion RL is really the milestone of machine learning, not those magic boxes that you put data into and give you some sort of unexplainable result. submitted by /u/HesperusIII [link] [comments]  ( 4 min )
  • Open

    AI Image Recognition 2022: A Comprehensive Guide
    Check to understand how image recognition technology works and why image detection revolutionizes business.  ( 8 min )
  • Open

    Computational modeling guides development of new materials
    Chemical engineers use neural networks to discover the properties of metal-organic frameworks, for catalysis and other applications.  ( 6 min )
  • Open

    Reliable Adversarial Distillation with Unreliable Teachers. (arXiv:2106.04928v2 [cs.LG] UPDATED)
    In ordinary distillation, student networks are trained with soft labels (SLs) given by pretrained teacher networks, and students are expected to improve upon teachers since SLs are stronger supervision than the original hard labels. However, when considering adversarial robustness, teachers may become unreliable and adversarial distillation may not work: teachers are pretrained on their own adversarial data, and it is too demanding to require that teachers are also good at every adversarial data queried by students. Therefore, in this paper, we propose reliable introspective adversarial distillation (IAD) where students partially instead of fully trust their teachers. Specifically, IAD distinguishes between three cases given a query of a natural data (ND) and the corresponding adversarial data (AD): (a) if a teacher is good at AD, its SL is fully trusted; (b) if a teacher is good at ND but not AD, its SL is partially trusted and the student also takes its own SL into account; (c) otherwise, the student only relies on its own SL. Experiments demonstrate the effectiveness of IAD for improving upon teachers in terms of adversarial robustness.  ( 2 min )
    Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-grained Environments. (arXiv:2106.03917v3 [cs.LG] UPDATED)
    Many real-world scenarios in which DNN-based recognition systems are deployed have inherently fine-grained attributes (e.g., bird-species recognition, medical image classification). In addition to achieving reliable accuracy, a critical subtask for these models is to detect Out-of-distribution (OOD) inputs. Given the nature of the deployment environment, one may expect such OOD inputs to also be fine-grained w.r.t. the known classes (e.g., a novel bird species), which are thus extremely difficult to identify. Unfortunately, OOD detection in fine-grained scenarios remains largely underexplored. In this work, we aim to fill this gap by first carefully constructing four large-scale fine-grained test environments, in which existing methods are shown to have difficulties. Particularly, we find that even explicitly incorporating a diverse set of auxiliary outlier data during training does not provide sufficient coverage over the broad region where fine-grained OOD samples locate. We then propose Mixture Outlier Exposure (MixOE), which mixes ID data and training outliers to expand the coverage of different OOD granularities, and trains the model such that the prediction confidence linearly decays as the input transitions from ID to OOD. Extensive experiments and analyses demonstrate the effectiveness of MixOE for building up OOD detector in fine-grained environments. Code is available at https://github.com/zjysteven/MixOE.  ( 2 min )
    Fair Decision-Making for Food Inspections. (arXiv:2108.05523v2 [cs.LG] UPDATED)
    Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize the detection of critical food code violations. We perform the first analysis of the model's fairness to the population served by the restaurants in terms of average time to find a critical violation. We find that the model treats inspections unequally based on the sanitarian who conducted the inspection and that, in turn, there are geographic disparities in the benefits of the model. We examine four alternate methods of model training and two alternative ways of scheduling using the model and find that the latter generate more desirable results. The challenges from this application point to important directions for future work around fairness with collective entities rather than individuals, the use of critical violations as a proxy, and the disconnect between fair classification and fairness in the dynamic scheduling system.  ( 2 min )
    Automatic event detection in football using tracking data. (arXiv:2202.00804v2 [cs.LG] UPDATED)
    One of the main shortcomings of event data in football, which has been extensively used for analytics in the recent years, is that it still requires manual collection, thus limiting its availability to a reduced number of tournaments. In this work, we propose a computational, decision tree-based approach to automatically extract football events using tracking data, which consists of two models: (1) a possession model that evaluates which player was in possession of the ball at each frame in the tracking data, as well as the distinct player configurations in the time intervals where the ball is not in play; (2) an event detection model that combines the laws of football with the changes in ball possession computed in the first step to determine in-game events and set pieces. The automatically generated events are benchmarked against manually annotated events and we show that in most event categories the proposed methodology achieves $+90\%$ detection rate across different tournaments and tracking data providers. Finally, we demonstrate how the contextual information offered by tracking data can be leveraged to increase the granularity of auto-detected events, and exhibit how the proposed framework may be used to conduct a myriad of data analyses in football.  ( 2 min )
    Machine Learning Methods in Solving the Boolean Satisfiability Problem. (arXiv:2203.04755v1 [cs.AI])
    This paper reviews the recent literature on solving the Boolean satisfiability problem (SAT), an archetypal NP-complete problem, with the help of machine learning techniques. Despite the great success of modern SAT solvers to solve large industrial instances, the design of handcrafted heuristics is time-consuming and empirical. Under the circumstances, the flexible and expressive machine learning methods provide a proper alternative to solve this long-standing problem. We examine the evolving ML-SAT solvers from naive classifiers with handcrafted features to the emerging end-to-end SAT solvers such as NeuroSAT, as well as recent progress on combinations of existing CDCL and local search solvers with machine learning methods. Overall, solving SAT with machine learning is a promising yet challenging research topic. We conclude the limitations of current works and suggest possible future directions.  ( 2 min )
    Estimate of the Neural Network Dimension using Algebraic Topology and Lie Theory. (arXiv:2004.02881v12 [stat.ML] UPDATED)
    In this paper we present an approach to determine the smallest possible number of neurons in a layer of a neural network in such a way that the topology of the input space can be learned sufficiently well. We introduce a general procedure based on persistent homology to investigate topological invariants of the manifold on which we suspect the data set. We specify the required dimensions precisely, assuming that there is a smooth manifold on or near which the data are located. Furthermore, we require that this space is connected and has a commutative group structure in the mathematical sense. These assumptions allow us to derive a decomposition of the underlying space whose topology is well known. We use the representatives of the $k$-dimensional homology groups from the persistence landscape to determine an integer dimension for this decomposition. This number is the dimension of the embedding that is capable of capturing the topology of the data manifold. We derive the theory and validate it experimentally on toy data sets.  ( 2 min )
    HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data. (arXiv:2102.04252v2 [cs.CY] UPDATED)
    Clinical trials are crucial for drug development but are time consuming, expensive, and often burdensome on patients. More importantly, clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment. If we were better at predicting the results of clinical trials, we could avoid having to run trials that will inevitably fail more resources could be devoted to trials that are likely to succeed. In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions for all diseases based on a comprehensive and diverse set of web data including molecule information of the drugs, target disease information, trial protocol and biomedical knowledge. HINT first encode these multi-modal data into latent embeddings, where an imputation module is designed to handle missing data. Next, these embeddings will be fed into the knowledge embedding module to generate knowledge embeddings that are pretrained using external knowledge on pharmaco-kinetic properties and trial risk from the web. Then the interaction graph module will connect all the embedding via domain knowledge to fully capture various trial components and their complex relations as well as their influences on trial outcomes. Finally, HINT learns a dynamic attentive graph neural network to predict trial outcome. Comprehensive experimental results show that HINT achieves strong predictive performance, obtaining 0.772, 0.607, 0.623, 0.703 on PR-AUC for Phase I, II, III, and indication outcome prediction, respectively. It also consistently outperforms the best baseline method by up to 12.4\% on PR-AUC.  ( 2 min )
    Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns. (arXiv:2109.07723v2 [cs.LG] UPDATED)
    Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks, raising concerns about the application of such models to risk-sensitive tasks such as autonomous driving. Threat models for these demonstrations are limited to (1) targeted attacks through real-time manipulation of the agent's observation, and (2) untargeted attacks through manipulation of the physical environment. The former assumes full access to the agent's states/observations at all times, while the latter has no control over attack outcomes. This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical objects in the environment, a threat model that combines the practicality and effectiveness of the existing ones. Through analysis, we demonstrate that a pre-trained policy can be hijacked within a time window, e.g., performing an unintended self-parking, when an adversarial object is present. To enable the attack, we adopt an assumption that the dynamics of both the environment and the agent can be learned by the attacker. Lastly, we empirically show the effectiveness of the proposed attack on different driving scenarios, perform a location robustness test, and study the tradeoff between the attack strength and its effectiveness. Code is available at https://github.com/ASU-APG/Targeted-Physical-Adversarial-Attacks-on-AD  ( 2 min )
    Investigation of Factorized Optical Flows as Mid-Level Representations. (arXiv:2203.04927v1 [cs.LG])
    In this paper, we introduce a new concept of incorporating factorized flow maps as mid-level representations, for bridging the perception and the control modules in modular learning based robotic frameworks. To investigate the advantages of factorized flow maps and examine their interplay with the other types of mid-level representations, we further develop a configurable framework, along with four different environments that contain both static and dynamic objects, for analyzing the impacts of factorized optical flow maps on the performance of deep reinforcement learning agents. Based on this framework, we report our experimental results on various scenarios, and offer a set of analyses to justify our hypothesis. Finally, we validate flow factorization in real world scenarios.  ( 2 min )
    Homological Time Series Analysis of Sensor Signals from Power Plants. (arXiv:2106.02493v5 [cs.LG] UPDATED)
    In this paper, we use topological data analysis techniques to construct a suitable neural network classifier for the task of learning sensor signals of entire power plants according to their reference designation system. We use representations of persistence diagrams to derive necessary preprocessing steps and visualize the large amounts of data. We derive deep architectures with one-dimensional convolutional layers combined with stacked long short-term memories as residual networks suitable for processing the persistence features. We combine three separate sub-networks, obtaining as input the time series itself and a representation of the persistent homology for the zeroth and first dimension. We give a mathematical derivation for most of the used hyper-parameters. For validation, numerical experiments were performed with sensor data from four power plants of the same construction type.  ( 2 min )
    Memory Replay with Data Compression for Continual Learning. (arXiv:2202.06592v2 [cs.LG] UPDATED)
    Continual learning needs to overcome catastrophic forgetting of the past. Memory replay of representative old training samples has been shown as an effective solution, and achieves the state-of-the-art (SOTA) performance. However, existing work is mainly built on a small memory buffer containing a few original data, which cannot fully characterize the old data distribution. In this work, we propose memory replay with data compression (MRDC) to reduce the storage cost of old training samples and thus increase their amount that can be stored in the memory buffer. Observing that the trade-off between the quality and quantity of compressed data is highly nontrivial for the efficacy of memory replay, we propose a novel method based on determinantal point processes (DPPs) to efficiently determine an appropriate compression quality for currently-arrived training samples. In this way, using a naive data compression algorithm with a properly selected quality can largely boost recent strong baselines by saving more compressed data in a limited storage space. We extensively validate this across several benchmarks of class-incremental learning and in a realistic scenario of object detection for autonomous driving.  ( 2 min )
    Are All Layers Created Equal?. (arXiv:1902.01996v4 [stat.ML] UPDATED)
    Understanding deep neural networks is a major research objective with notable experimental and theoretical attention in recent years. The practical success of excessively large networks underscores the need for better theoretical analyses and justifications. In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers' robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either "robust" or "critical". Resetting the robust layers to their initial values does not result in adverse decline in performance. In many cases, robust layers hardly change throughout training. In contrast, re-initializing critical layers vastly degrades the performance of the network with test error essentially dropping to random guesses. Our study provides further evidence that mere parameter counting or norm calculations are too coarse in studying generalization of deep models, and "flatness" and robustness analysis of trained models need to be examined while taking into account the respective network architectures.  ( 2 min )
    Multi-task Federated Edge Learning (MtFEEL) in Wireless Networks. (arXiv:2108.02517v3 [cs.IT] UPDATED)
    Federated Learning (FL) has evolved as a promising technique to handle distributed machine learning across edge devices. A single neural network (NN) that optimises a global objective is generally learned in most work in FL, which could be suboptimal for edge devices. Although works finding a NN personalised for edge device specific tasks exist, they lack generalisation and/or convergence guarantees. In this paper, a novel communication efficient FL algorithm for personalised learning in a wireless setting with guarantees is presented. The algorithm relies on finding a ``better`` empirical estimate of losses at each device, using a weighted average of the losses across different devices. It is devised from a Probably Approximately Correct (PAC) bound on the true loss in terms of the proposed empirical loss and is bounded by (i) the Rademacher complexity, (ii) the discrepancy, (iii) and a penalty term. Using a signed gradient feedback to find a personalised NN at each device, it is also proven to converge in a Rayleigh flat fading (in the uplink) channel, at a rate of the order max{1/SNR,1/sqrt(T)} Experimental results show that the proposed algorithm outperforms locally trained devices as well as the conventionally used FedAvg and FedSGD algorithms under practical SNR regimes.  ( 2 min )
    CEU-Net: Ensemble Semantic Segmentation of Hyperspectral Images Using Clustering. (arXiv:2203.04873v1 [cs.CV])
    Most semantic segmentation approaches of Hyperspectral images (HSIs) use and require preprocessing steps in the form of patching to accurately classify diversified land cover in remotely sensed images. These approaches use patching to incorporate the rich neighborhood information in images and exploit the simplicity and segmentability of the most common HSI datasets. In contrast, most landmasses in the world consist of overlapping and diffused classes, making neighborhood information weaker than what is seen in common HSI datasets. To combat this issue and generalize the segmentation models to more complex and diverse HSI datasets, in this work, we propose our novel flagship model: Clustering Ensemble U-Net (CEU-Net). CEU-Net uses the ensemble method to combine spectral information extracted from convolutional neural network (CNN) training on a cluster of landscape pixels. Our CEU-Net model outperforms existing state-of-the-art HSI semantic segmentation methods and gets competitive performance with and without patching when compared to baseline models. We highlight CEU-Net's high performance across Botswana, KSC, and Salinas datasets compared to HybridSN and AeroRIT methods.  ( 2 min )
    Learning from Physical Human Feedback: An Object-Centric One-Shot Adaptation Method. (arXiv:2203.04951v1 [cs.RO])
    For robots to be effectively deployed in novel environments and tasks, they must be able to understand the feedback expressed by humans during intervention. This can either correct undesirable behavior or indicate additional preferences. Existing methods either require repeated episodes of interactions or assume prior known reward features, which is data-inefficient and can hardly transfer to new tasks. We relax these assumptions by describing human tasks in terms of object-centric sub-tasks and interpreting physical interventions in relation to specific objects. Our method, Object Preference Adaptation (OPA), is composed of two key stages: 1) pre-training a base policy to produce a wide variety of behaviors, and 2) online-updating only certain weights in the model according to human feedback. The key to our fast, yet simple adaptation is that general interaction dynamics between agents and objects are fixed, and only object-specific preferences are updated. Our adaptation occurs online, requires only one human intervention (one-shot), and produces new behaviors never seen during training. Trained on cheap synthetic data instead of expensive human demonstrations, our policy demonstrates impressive adaptation to human perturbations on challenging, realistic tasks in our user study. Videos, code, and supplementary material provided.  ( 2 min )
    Robustifying Reinforcement Learning Policies with $\mathcal{L}_1$ Adaptive Control. (arXiv:2106.02249v5 [cs.LG] UPDATED)
    A reinforcement learning (RL) policy trained in a nominal environment could fail in a new/perturbed environment due to the existence of dynamic variations. Existing robust methods try to obtain a fixed policy for all envisioned dynamic variation scenarios through robust or adversarial training. These methods could lead to conservative performance due to emphasis on the worst case, and often involve tedious modifications to the training environment. We propose an approach to robustifying a pre-trained non-robust RL policy with $\mathcal{L}_1$ adaptive control. Leveraging the capability of an $\mathcal{L}_1$ control law in the fast estimation of and active compensation for dynamic variations, our approach can significantly improve the robustness of an RL policy trained in a standard (i.e., non-robust) way, either in a simulator or in the real world. Numerical experiments are provided to validate the efficacy of the proposed approach.  ( 2 min )
    Score-Based Generative Models for Molecule Generation. (arXiv:2203.04698v1 [cs.LG])
    Recent advances in generative models have made exploring design spaces easier for de novo molecule generation. However, popular generative models like GANs and normalizing flows face challenges such as training instabilities due to adversarial training and architectural constraints, respectively. Score-based generative models sidestep these challenges by modelling the gradient of the log probability density using a score function approximation, as opposed to modelling the density function directly, and sampling from it using annealed Langevin Dynamics. We believe that score-based generative models could open up new opportunities in molecule generation due to their architectural flexibility, such as replacing the score function with an SE(3) equivariant model. In this work, we lay the foundations by testing the efficacy of score-based models for molecule generation. We train a Transformer-based score function on Self-Referencing Embedded Strings (SELFIES) representations of 1.5 million samples from the ZINC dataset and use the Moses benchmarking framework to evaluate the generated samples on a suite of metrics.  ( 2 min )
    Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets. (arXiv:2203.04810v1 [cs.LG])
    This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. With these simple modifications, Graphormer could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on 2D and 3D molecular graph modeling tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Empirically, Graphormer could achieve much less MAE than the originally reported results on the PCQM4M quantum chemistry dataset used in KDD Cup 2021. In the meanwhile, it greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All codes could be found at https://github.com/Microsoft/Graphormer.  ( 2 min )
    Language Model-driven Negative Sampling. (arXiv:2203.04703v1 [cs.AI])
    Knowledge Graph Embeddings (KGEs) encode the entities and relations of a knowledge graph (KG) into a vector space with a purpose of representation learning and reasoning for an ultimate downstream task (i.e., link prediction, question answering). Since KGEs follow closed-world assumption and assume all the present facts in KGs to be positive (correct), they also require negative samples as a counterpart for learning process for truthfulness test of existing triples. Therefore, there are several approaches for creating negative samples from the existing positive ones through a randomized distribution. This choice of generating negative sampling affects the performance of the embedding models as well as their generalization. In this paper, we propose an approach for generating negative sampling considering the existing rich textual knowledge in KGs. %The proposed approach is leveraged to cluster other relevant representations of the entities inside a KG. Particularly, a pre-trained Language Model (LM) is utilized to obtain the contextual representation of symbolic entities. Our approach is then capable of generating more meaningful negative samples in comparison to other state of the art methods. Our comprehensive evaluations demonstrate the effectiveness of the proposed approach across several benchmark datasets for like prediction task. In addition, we show cased our the functionality of our approach on a clustering task where other methods fall short.  ( 2 min )
    Attention-effective multiple instance learning on weakly stem cell colony segmentation. (arXiv:2203.04606v1 [eess.IV])
    The detection of induced pluripotent stem cell (iPSC) colonies often needs the precise extraction of the colony features. However, existing computerized systems relied on segmentation of contours by preprocessing for classifying the colony conditions were task-extensive. To maximize the efficiency in categorizing colony conditions, we propose a multiple instance learning (MIL) in weakly supervised settings. It is designed in a single model to produce weak segmentation and classification of colonies without using finely labeled samples. As a single model, we employ a U-net-like convolution neural network (CNN) to train on binary image-level labels for MIL colonies classification. Furthermore, to specify the object of interest we used a simple post-processing method. The proposed approach is compared over conventional methods using five-fold cross-validation and receiver operating characteristic (ROC) curve. The maximum accuracy of the MIL-net is 95%, which is 15 % higher than the conventional methods. Furthermore, the ability to interpret the location of the iPSC colonies based on the image level label without using a pixel-wise ground truth image is more appealing and cost-effective in colony condition recognition.  ( 2 min )
    Regularized Training of Intermediate Layers for Generative Models for Inverse Problems. (arXiv:2203.04382v1 [cs.LG])
    Generative Adversarial Networks (GANs) have been shown to be powerful and flexible priors when solving inverse problems. One challenge of using them is overcoming representation error, the fundamental limitation of the network in representing any particular signal. Recently, multiple proposed inversion algorithms reduce representation error by optimizing over intermediate layer representations. These methods are typically applied to generative models that were trained agnostic of the downstream inversion algorithm. In our work, we introduce a principle that if a generative model is intended for inversion using an algorithm based on optimization of intermediate layers, it should be trained in a way that regularizes those intermediate layers. We instantiate this principle for two notable recent inversion algorithms: Intermediate Layer Optimization and the Multi-Code GAN prior. For both of these inversion algorithms, we introduce a new regularized GAN training algorithm and demonstrate that the learned generative model results in lower reconstruction errors across a wide range of under sampling ratios when solving compressed sensing, inpainting, and super-resolution problems.  ( 2 min )
    Multi-Agent Policy Transfer via Task Relationship Modeling. (arXiv:2203.04482v1 [cs.AI])
    Team adaptation to new cooperative tasks is a hallmark of human intelligence, which has yet to be fully realized in learning agents. Previous work on multi-agent transfer learning accommodate teams of different sizes, heavily relying on the generalization ability of neural networks for adapting to unseen tasks. We believe that the relationship among tasks provides the key information for policy adaptation. In this paper, we try to discover and exploit common structures among tasks for more efficient transfer, and propose to learn effect-based task representations as a common space of tasks, using an alternatively fixed training scheme. We demonstrate that the task representation can capture the relationship among tasks, and can generalize to unseen tasks. As a result, the proposed method can help transfer learned cooperation knowledge to new tasks after training on a few source tasks. We also find that fine-tuning the transferred policies help solve tasks that are hard to learn from scratch.  ( 2 min )
    All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators. (arXiv:2203.04566v1 [cs.CV])
    Large-scale semantic image annotation is a significant challenge for learning-based perception systems in robotics. Current approaches often rely on human labelers, which can be expensive, or simulation data, which can visually or physically differ from real data. This paper proposes Labels from UltraViolet (LUV), a novel framework that enables rapid, labeled data collection in real manipulation environments without human labeling. LUV uses transparent, ultraviolet-fluorescent paint with programmable ultraviolet LEDs to collect paired images of a scene in standard lighting and UV lighting to autonomously extract segmentation masks and keypoints via color segmentation. We apply LUV to a suite of diverse robot perception tasks to evaluate its labeling quality, flexibility, and data collection rate. Results suggest that LUV is 180-2500 times faster than a human labeler across the tasks. We show that LUV provides labels consistent with human annotations on unpainted test images. The networks trained on these labels are used to smooth and fold crumpled towels with 83% success rate and achieve 1.7mm position error with respect to human labels on a surgical needle pose estimation task. The low cost of LUV makes it ideal as a lightweight replacement for human labeling systems, with the one-time setup costs at $300 equivalent to the cost of collecting around 200 semantic segmentation labels on Amazon Mechanical Turk. Code, datasets, visualizations, and supplementary material can be found at https://sites.google.com/berkeley.edu/luv  ( 2 min )
    Explainable Machine Learning for Predicting Homicide Clearance in the United States. (arXiv:2203.04768v1 [cs.LG])
    Purpose: To explore the potential of Explainable Machine Learning in the prediction and detection of drivers of cleared homicides at the national- and state-levels in the United States. Methods: First, nine algorithmic approaches are compared to assess the best performance in predicting cleared homicides country-wise, using data from the Murder Accountability Project. The most accurate algorithm among all (XGBoost) is then used for predicting clearance outcomes state-wise. Second, SHAP, a framework for Explainable Artificial Intelligence, is employed to capture the most important features in explaining clearance patterns both at the national and state levels. Results: At the national level, XGBoost demonstrates to achieve the best performance overall. Substantial predictive variability is detected state-wise. In terms of explainability, SHAP highlights the relevance of several features in consistently predicting investigation outcomes. These include homicide circumstances, weapons, victims' sex and race, as well as number of involved offenders and victims. Conclusions: Explainable Machine Learning demonstrates to be a helpful framework for predicting homicide clearance. SHAP outcomes suggest a more organic integration of the two theoretical perspectives emerged in the literature. Furthermore, jurisdictional heterogeneity highlights the importance of developing ad hoc state-level strategies to improve police performance in clearing homicides.  ( 2 min )
    Rank-Regret Minimization. (arXiv:2111.08563v2 [cs.LG] UPDATED)
    Multi-criteria decision-making often requires finding a small representative set from the database. A recently proposed method is the regret minimization set (RMS) query. RMS returns a size $r$ subset $S$ of dataset $D$ that minimizes the regret-ratio (the difference between the score of top-1 in $S$ and the score of top-1 in $D$, for any possible utility function). RMS is not shift invariant, causing inconsistency in results. Further, existing work showed that the regret-ratio is often a made-up number and users may mistake its absolute value. Instead, users do understand the notion of rank. Thus it considered the problem of finding the minimal set $S$ with a rank-regret (the rank of top-1 tuple of $S$ in the sorted list of $D$) at most $k$, called the rank-regret representative (RRR) problem. Corresponding to RMS, we focus on the min-error version of RRR, called the rank-regret minimization (RRM) problem, which finds a size $r$ set to minimize the maximum rank-regret for all utility functions. Further, we generalize RRM and propose the restricted RRM (i.e., RRRM) problem to optimize the rank-regret for functions restricted in a given space. Previous studies on both RMS and RRR did not consider the restricted function space. The solution for RRRM usually has a lower regret level and can better serve the specific preferences of some users. Note that RRM and RRRM are shift invariant. In 2D space, we design a dynamic programming algorithm 2DRRM to return the optimal solution for RRM. In HD space, we propose an algorithm HDRRM that introduces a double approximation guarantee on rank-regret. Both 2DRRM and HDRRM are applicable for RRRM. Extensive experiments on the synthetic and real datasets verify the efficiency and effectiveness of our algorithms. In particular, HDRRM always has the best output quality in experiments.  ( 2 min )
    TLogic: Temporal Logical Rules for Explainable Link Forecasting on Temporal Knowledge Graphs. (arXiv:2112.08025v2 [cs.LG] UPDATED)
    Conventional static knowledge graphs model entities in relational data as nodes, connected by edges of specific relation types. However, information and knowledge evolve continuously, and temporal dynamics emerge, which are expected to influence future situations. In temporal knowledge graphs, time information is integrated into the graph by equipping each edge with a timestamp or a time range. Embedding-based methods have been introduced for link prediction on temporal knowledge graphs, but they mostly lack explainability and comprehensible reasoning chains. Particularly, they are usually not designed to deal with link forecasting -- event prediction involving future timestamps. We address the task of link forecasting on temporal knowledge graphs and introduce TLogic, an explainable framework that is based on temporal logical rules extracted via temporal random walks. We compare TLogic with state-of-the-art baselines on three benchmark datasets and show better overall performance while our method also provides explanations that preserve time consistency. Furthermore, in contrast to most state-of-the-art embedding-based methods, TLogic works well in the inductive setting where already learned rules are transferred to related datasets with a common vocabulary.  ( 2 min )
    Align-Deform-Subtract: An Interventional Framework for Explaining Object Differences. (arXiv:2203.04694v1 [cs.CV])
    Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes differences in object properties. The result is a set of "disentangled" error measures which explain object differences in terms of their underlying properties. Experiments on real and synthetic data illustrate the efficacy of the framework.  ( 2 min )
    Deep Generative Models for Downlink Channel Estimation in FDD Massive MIMO Systems. (arXiv:2203.04935v1 [cs.IT])
    It is well accepted that acquiring downlink channel state information in frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems is challenging because of the large overhead in training and feedback. In this paper, we propose a deep generative model (DGM)-based technique to address this challenge. Exploiting the partial reciprocity of uplink and downlink channels, we first estimate the frequency-independent underlying channel parameters, i.e., the magnitudes of path gains, delays, angles-of-arrivals (AoAs) and angles-of-departures (AoDs), via uplink training, since these parameters are common in both uplink and downlink. Then, the frequency-specific underlying channel parameters, namely, the phase of each propagation path, are estimated via downlink training using a very short training signal. In the first step, we incorporate the underlying distribution of the channel parameters as a prior into our channel estimation algorithm. We use DGMs to learn this distribution. Simulation results indicate that our proposed DGM-based channel estimation technique outperforms, by a large gap, the conventional channel estimation techniques in practical ranges of signal-to-noise ratio (SNR). In addition, a near-optimal performance is achieved using only few downlink pilot measurements.  ( 2 min )
    Boilerplate Detection via Semantic Classification of TextBlocks. (arXiv:2203.04467v1 [cs.CL])
    We present a hierarchical neural network model called SemText to detect HTML boilerplate based on a novel semantic representation of HTML tags, class names, and text blocks. We train SemText on three published datasets of news webpages and fine-tune it using a small number of development data in CleanEval and GoogleTrends-2017. We show that SemText achieves the state-of-the-art accuracy on these datasets. We then demonstrate the robustness of SemText by showing that it also detects boilerplate effectively on out-of-domain community-based question-answer webpages.  ( 2 min )
    Structure and Distribution Metric for Quantifying the Quality of Uncertainty: Assessing Gaussian Processes, Deep Neural Nets, and Deep Neural Operators for Regression. (arXiv:2203.04515v1 [cs.LG])
    We propose two bounded comparison metrics that may be implemented to arbitrary dimensions in regression tasks. One quantifies the structure of uncertainty and the other quantifies the distribution of uncertainty. The structure metric assesses the similarity in shape and location of uncertainty with the true error, while the distribution metric quantifies the supported magnitudes between the two. We apply these metrics to Gaussian Processes (GPs), Ensemble Deep Neural Nets (DNNs), and Ensemble Deep Neural Operators (DNOs) on high-dimensional and nonlinear test cases. We find that comparing a model's uncertainty estimates with the model's squared error provides a compelling ground truth assessment. We also observe that both DNNs and DNOs, especially when compared to GPs, provide encouraging metric values in high dimensions with either sparse or plentiful data.  ( 2 min )
    HAIDA: Biometric technological therapy tools for neurorehabilitation of Cognitive Impairment. (arXiv:2203.04645v1 [cs.LG])
    Dementia, and specially Alzheimer s disease (AD) and Mild Cognitive Impairment (MCI) are one of the most important diseases suffered by elderly population. Music therapy is one of the most widely used non-pharmacological treatment in the field of cognitive impairments, given that music influences their mood, behavior, the decrease of anxiety, as well as facilitating reminiscence, emotional expressions and movement. In this work we present HAIDA, a multi-platform support system for Musical Therapy oriented to cognitive impairment, which includes not only therapy tools but also non-invasive biometric analysis, speech, activity and hand activity. At this moment the system is on use and recording the first sets of data.  ( 2 min )
    Structural & Granger CAUSALITY for IoT Digital Twin. (arXiv:2203.04876v1 [eess.SP])
    In this foundational expository article on the application of Causality Analysis in IoT, we establish the basic theory and algorithms for estimating Structural and Granger causality factors from measured multichannel sensor data (vector timeseries). Vector timeseries is modeled as a Structural Vector Autoregressive (SVAR) model; utilizing Kalman Filter and Independent Component Analysis (ICA) methods, Structural and generalized Granger causality factors are estimated. The estimated causal factors are presented as a Fence graph which we call Causal Digital Twin. Practical applications of Causal Digital Twin are demonstrated on NASA Prognostic Data Repository Bearing data collection. Use of Causal Digital Twin for counterfactual experiments are indicated. Causal Digital Twin is a horizontal solution that applies to diverse use cases in multiple industries such as Industrial, Manufacturing, Automotive, Consumer, Building and Smart City.  ( 2 min )
    A Neuro-vector-symbolic Architecture for Solving Raven's Progressive Matrices. (arXiv:2203.04571v1 [cs.LG])
    Neither deep neural networks nor symbolic AI alone have approached the kind of intelligence expressed in humans. This is mainly because neural networks are not able to decompose distinct objects from their joint representation (the so-called binding problem), while symbolic AI suffers from exhaustive rule searches, among other problems. These two problems are still pronounced in neuro-symbolic AI which aims to combine the best of the two paradigms. Here, we show that the two problems can be addressed with our proposed neuro-vector-symbolic architecture (NVSA) by exploiting its powerful operators on fixed-width holographic vectorized representations that serve as a common language between neural networks and symbolic logical reasoning. The efficacy of NVSA is demonstrated by solving the Raven's progressive matrices. NVSA achieves a new record of 97.7% average accuracy in RAVEN, and 98.8% in I-RAVEN datasets, with two orders of magnitude faster execution than the symbolic logical reasoning on CPUs.  ( 2 min )
    Persistent Homology as Stopping-Criterion for Voronoi Interpolation. (arXiv:1911.02922v15 [cs.CG] UPDATED)
    In this study the Voronoi interpolation is used to interpolate a set of points drawn from a topological space with higher homology groups on its filtration. The technique is based on Voronoi tessellation, which induces a natural dual map to the Delaunay triangulation. Advantage is taken from this fact calculating the persistent homology on it after each iteration to capture the changing topology of the data. The boundary points are identified as critical. The Bottleneck and Wasserstein distance serve as a measure of quality between the original point set and the interpolation. If the norm of two distances exceeds a heuristically determined threshold, the algorithm terminates. We give the theoretical basis for this approach and justify its validity with numerical experiments.  ( 2 min )
    SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic Characters. (arXiv:2203.04746v1 [cs.CV])
    This work presents SkinningNet, an end-to-end Two-Stream Graph Neural Network architecture that computes skinning weights from an input mesh and its associated skeleton, without making any assumptions on shape class and structure of the provided mesh. Whereas previous methods pre-compute handcrafted features that relate the mesh and the skeleton or assume a fixed topology of the skeleton, the proposed method extracts this information in an end-to-end learnable fashion by jointly learning the best relationship between mesh vertices and skeleton joints. The proposed method exploits the benefits of the novel Multi-Aggregator Graph Convolution that combines the results of different aggregators during the summarizing step of the Message-Passing scheme, helping the operation to generalize for unseen topologies. Experimental results demonstrate the effectiveness of the contributions of our novel architecture, with SkinningNet outperforming current state-of-the-art alternatives.  ( 2 min )
    The Cross-evaluation of Machine Learning-based Network Intrusion Detection Systems. (arXiv:2203.04686v1 [cs.CR])
    Enhancing Network Intrusion Detection Systems (NIDS) with supervised Machine Learning (ML) is tough. ML-NIDS must be trained and evaluated, operations requiring data where benign and malicious samples are clearly labelled. Such labels demand costly expert knowledge, resulting in a lack of real deployments, as well as on papers always relying on the same outdated data. The situation improved recently, as some efforts disclosed their labelled datasets. However, most past works used such datasets just as a 'yet another' testbed, overlooking the added potential provided by such availability. In contrast, we promote using such existing labelled data to cross-evaluate ML-NIDS. Such approach received only limited attention and, due to its complexity, requires a dedicated treatment. We hence propose the first cross-evaluation model. Our model highlights the broader range of realistic use-cases that can be assessed via cross-evaluations, allowing the discovery of still unknown qualities of state-of-the-art ML-NIDS. For instance, their detection surface can be extended--at no additional labelling cost. However, conducting such cross-evaluations is challenging. Hence, we propose the first framework, XeNIDS, for reliable cross-evaluations based on Network Flows. By using XeNIDS on six well-known datasets, we demonstrate the concealed potential, but also the risks, of cross-evaluations of ML-NIDS.  ( 2 min )
    SparseChem: Fast and accurate machine learning model for small molecules. (arXiv:2203.04676v1 [stat.ML])
    SparseChem provides fast and accurate machine learning models for biochemical applications. Especially, the package supports very high-dimensional sparse inputs, e.g., millions of features and millions of compounds. It is possible to train classification, regression and censored regression models, or combination of them from command line. Additionally, the library can be accessed directly from Python. Source code and documentation is freely available under MIT License on GitHub.  ( 2 min )
    Correlated quantization for distributed mean estimation and optimization. (arXiv:2203.04925v1 [cs.LG])
    We study the problem of distributed mean estimation and optimization under communication constraints. We propose a correlated quantization protocol whose error guarantee depends on the deviation of data points instead of their absolute range. The design doesn't need any prior knowledge on the concentration property of the dataset, which is required to get such dependence in previous works. We show that applying the proposed protocol as sub-routine in distributed optimization algorithms leads to better convergence rates. We also prove the optimality of our protocol under mild assumptions. Experimental results show that our proposed algorithm outperforms existing mean estimation protocols on a diverse set of tasks.
    Federated Minimax Optimization: Improved Convergence Analyses and Algorithms. (arXiv:2203.04850v1 [math.OC])
    In this paper, we consider nonconvex minimax optimization, which is gaining prominence in many modern machine learning applications such as GANs. Large-scale edge-based collection of training data in these applications calls for communication-efficient distributed optimization algorithms, such as those used in federated learning, to process the data. In this paper, we analyze Local stochastic gradient descent ascent (SGDA), the local-update version of the SGDA algorithm. SGDA is the core algorithm used in minimax optimization, but it is not well-understood in a distributed setting. We prove that Local SGDA has \textit{order-optimal} sample complexity for several classes of nonconvex-concave and nonconvex-nonconcave minimax problems, and also enjoys \textit{linear speedup} with respect to the number of clients. We provide a novel and tighter analysis, which improves the convergence and communication guarantees in the existing literature. For nonconvex-PL and nonconvex-one-point-concave functions, we improve the existing complexity results for centralized minimax problems. Furthermore, we propose a momentum-based local-update algorithm, which has the same convergence guarantees, but outperforms Local SGDA as demonstrated in our experiments.
    Improving Compound Activity Classification via Deep Transfer and Representation Learning. (arXiv:2111.07439v2 [cs.LG] UPDATED)
    Recent advances in molecular machine learning, especially deep neural networks such as Graph Neural Networks (GNNs) for predicting structure activity relationships (SAR) have shown tremendous potential in computer-aided drug discovery. However, the applicability of such deep neural networks are limited by the requirement of large amounts of training data. In order to cope with limited training data for a target task, transfer learning for SAR modeling has been recently adopted to leverage information from data of related tasks. In this work, in contrast to the popular parameter-based transfer learning such as pretraining, we develop novel deep transfer learning methods TAc and TAc-fc to leverage source domain data and transfer useful information to the target domain. TAc learns to generate effective molecular features that can generalize well from one domain to another, and increase the classification performance in the target domain. Additionally, TAc-fc extends TAc by incorporating novel components to selectively learn feature-wise and compound-wise transferability. We used the bioassay screening data from PubChem, and identified 120 pairs of bioassays such that the active compounds in each pair are more similar to each other compared to its inactive compounds. Our experiments clearly demonstrate that TAc achieves significant improvement over all baselines across a large number of target tasks. Furthermore, although TAc-fc achieves slightly worse ROC-AUC on average compared to TAc, TAc-fc still achieves the best performance on more tasks in terms of PR-AUC and F1 compared to other methods. In summary, TAc-fc is also found to be a strong model with competitive or even better performance than TAc on a notable number of target tasks.
    Wasserstein-based fairness interpretability framework for machine learning models. (arXiv:2011.03156v5 [cs.LG] UPDATED)
    The objective of this article is to introduce a fairness interpretability framework for measuring and explaining the bias in classification and regression models at the level of a distribution. In our work, we measure the model bias across sub-population distributions in the model output using the Wasserstein metric. To properly quantify the contributions of predictors, we take into account the favorability of both the model and predictors with respect to the non-protected class. The quantification is accomplished by the use of transport theory, which gives rise to the decomposition of the model bias and bias explanations to positive and negative contributions. To gain more insight into the role of favorability and allow for additivity of bias explanations, we adapt techniques from cooperative game theory.
    Regularized Deep Signed Distance Fields for Reactive Motion Generation. (arXiv:2203.04739v1 [cs.RO])
    Autonomous robots should operate in real-world dynamic environments and collaborate with humans in tight spaces. A key component for allowing robots to leave structured lab and manufacturing settings is their ability to evaluate online and real-time collisions with the world around them. Distance-based constraints are fundamental for enabling robots to plan their actions and act safely, protecting both humans and their hardware. However, different applications require different distance resolutions, leading to various heuristic approaches for measuring distance fields w.r.t. obstacles, which are computationally expensive and hinder their application in dynamic obstacle avoidance use-cases. We propose Regularized Deep Signed Distance Fields (ReDSDF), a single neural implicit function that can compute smooth distance fields at any scale, with fine-grained resolution over high-dimensional manifolds and articulated bodies like humans, thanks to our effective data generation and a simple inductive bias during training. We demonstrate the effectiveness of our approach in representative simulated tasks for whole-body control (WBC) and safe Human-Robot Interaction (HRI) in shared workspaces. Finally, we provide proof of concept of a real-world application in a HRI handover task with a mobile manipulator robot.
    EvoLearner: Learning Description Logics with Evolutionary Algorithms. (arXiv:2111.04879v2 [cs.AI] UPDATED)
    Classifying nodes in knowledge graphs is an important task, e.g., for predicting missing types of entities, predicting which molecules cause cancer, or predicting which drugs are promising treatment candidates. While black-box models often achieve high predictive performance, they are only post-hoc and locally explainable and do not allow the learned model to be easily enriched with domain knowledge. Towards this end, learning description logic concepts from positive and negative examples has been proposed. However, learning such concepts often takes a long time and state-of-the-art approaches provide limited support for literal data values, although they are crucial for many applications. In this paper, we propose EvoLearner - an evolutionary approach to learn concepts in ALCQ(D), which is the attributive language with complement (ALC) paired with qualified cardinality restrictions (Q) and data properties (D). We contribute a novel initialization method for the initial population: starting from positive examples, we perform biased random walks and translate them to description logic concepts. Moreover, we improve support for data properties by maximizing information gain when deciding where to split the data. We show that our approach significantly outperforms the state of the art on the benchmarking framework SML-Bench for structured machine learning. Our ablation study confirms that this is due to our novel initialization method and support for data properties.
    ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling. (arXiv:2203.04510v1 [cs.LG])
    This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
    FragmGAN: Generative Adversarial Nets for Fragmentary Data Imputation and Prediction. (arXiv:2203.04692v1 [cs.LG])
    Modern scientific research and applications very often encounter "fragmentary data" which brings big challenges to imputation and prediction. By leveraging the structure of response patterns, we propose a unified and flexible framework based on Generative Adversarial Nets (GAN) to deal with fragmentary data imputation and label prediction at the same time. Unlike most of the other generative model based imputation methods that either have no theoretical guarantee or only consider Missing Completed At Random (MCAR), the proposed FragmGAN has theoretical guarantees for imputation with data Missing At Random (MAR) while no hint mechanism is needed. FragmGAN trains a predictor with the generator and discriminator simultaneously. This linkage mechanism shows significant advantages for predictive performances in extensive experiments.
    Speaker Identification Experiments Under Gender De-Identification. (arXiv:2203.04638v1 [cs.SD])
    The present work is based on the COST Action IC1206 for De-identification in multimedia content. It was performed to test four algorithms of voice modifications on a speech gender recognizer to find the degree of modification of pitch when the speech recognizer have the probability of success equal to the probability of failure. The purpose of this analysis is to assess the intensity of the speech tone modification, the quality, the reversibility and not-reversibility of the changes made. Keywords DeIdentification; Speech Algorithms
    Dimensionality Reduction and Prioritized Exploration for Policy Search. (arXiv:2203.04791v1 [cs.LG])
    Black-box policy optimization is a class of reinforcement learning algorithms that explores and updates the policies at the parameter level. This class of algorithms is widely applied in robotics with movement primitives or non-differentiable policies. Furthermore, these approaches are particularly relevant where exploration at the action level could cause actuator damage or other safety issues. However, Black-box optimization does not scale well with the increasing dimensionality of the policy, leading to high demand for samples, which are expensive to obtain in real-world systems. In many practical applications, policy parameters do not contribute equally to the return. Identifying the most relevant parameters allows to narrow down the exploration and speed up the learning. Furthermore, updating only the effective parameters requires fewer samples, improving the scalability of the method. We present a novel method to prioritize the exploration of effective parameters and cope with full covariance matrix updates. Our algorithm learns faster than recent approaches and requires fewer samples to achieve state-of-the-art results. To select the effective parameters, we consider both the Pearson correlation coefficient and the Mutual Information. We showcase the capabilities of our approach on the Relative Entropy Policy Search algorithm in several simulated environments, including robotics simulations. Code is available at https://git.ias.informatik.tu-darmstadt.de/ias\_code/aistats2022/dr-creps}{git.ias.informatik.tu-darmstadt.de/ias\_code/aistats2022/dr-creps.
    Multi-Objective reward generalization: Improving performance of Deep Reinforcement Learning for selected applications in stock and cryptocurrency trading. (arXiv:2203.04579v1 [cs.LG])
    We investigate the potential of Multi-Objective, Deep Reinforcement Learning for stock and cryptocurrency trading. More specifically, we build on the generalized setting \`a la Fontaine and Friedman arXiv:1809.06364 (where the reward weighting mechanism is not specified a priori, but embedded in the learning process) by complementing it with computational speed-ups, and adding the cumulative reward's discount factor to the learning process. Firstly, we verify that the resulting Multi-Objective algorithm generalizes well, and we provide preliminary statistical evidence showing that its prediction is more stable than the corresponding Single-Objective strategy's. Secondly, we show that the Multi-Objective algorithm has a clear edge over the corresponding Single-Objective strategy when the reward mechanism is sparse (i.e., when non-null feedback is infrequent over time). Finally, we discuss the generalization properties of the discount factor. The entirety of our code is provided in open source format.
    Using Statistical Models to Detect Occupancy in Buildings through Monitoring VOC, CO$_2$, and other Environmental Factors. (arXiv:2203.04750v1 [cs.LG])
    Dynamic models of occupancy patterns have shown to be effective in optimizing building-systems operations. Previous research has relied on CO$_2$ sensors and vision-based techniques to determine occupancy patterns. Vision-based techniques provide highly accurate information; however, they are very intrusive. Therefore, motion or CO$_2$ sensors are more widely adopted worldwide. Volatile Organic Compounds (VOCs) are another pollutant originating from the occupants. However, a limited number of studies have evaluated the impact of occupants on the VOC level. In this paper, continuous measurements of CO$_2$, VOC, light, temperature, and humidity were recorded in a 17,000 sqft open office space for around four months. Using different statistical models (e.g., SVM, K-Nearest Neighbors, and Random Forest) we evaluated which combination of environmental factors provides more accurate insights on occupant presence. Our preliminary results indicate that VOC is a good indicator of occupancy detection in some cases. It is also concluded that proper feature selection and developing appropriate global occupancy detection models can reduce the cost and energy of data collection without a significant impact on accuracy.
    Reproducible Subjective Evaluation. (arXiv:2203.04444v1 [cs.HC])
    Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient detail to ensure reproducibility. We propose Reproducible Subjective Evaluation (ReSEval), an open-source framework for quickly deploying crowdsourced subjective evaluations directly from Python. ReSEval lets researchers launch A/B, ABX, Mean Opinion Score (MOS) and MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) tests on audio, image, text, or video data from a command-line interface or using one line of Python, making it as easy to run as objective evaluation. With ReSEval, researchers can reproduce each other's subjective evaluations by sharing a configuration file and the audio, image, text, or video files.
    Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems. (arXiv:2203.04420v1 [eess.AS])
    Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from biologically inspired algorithms that rely primarily on timing cues and not harmonicity to segregate speech.
    Invariant Representation Driven Neural Classifier for Anti-QCD Jet Tagging. (arXiv:2201.07199v2 [hep-ph] UPDATED)
    We leverage representation learning and the inductive bias in neural-net-based Standard Model jet classification tasks, to detect non-QCD signal jets. In establishing the framework for classification-based anomaly detection in jet physics, we demonstrate that with a \emph{well-calibrated} and \emph{powerful enough feature extractor}, a well-trained \emph{mass-decorrelated} supervised Standard Model neural jet classifier can serve as a strong generic anti-QCD jet tagger for effectively reducing the QCD background. Imposing \emph{data-augmented} mass-invariance (decoupling the dominant factor) not only facilitates background estimation, but also induces more substructure-aware representation learning. We are able to reach excellent tagging efficiencies for all the test signals considered. In the best case, we reach a background rejection rate of 51 and a significance improvement factor of 3.6 at 50 \% signal acceptance, with jet mass decorrelated. This study indicates that supervised Standard Model jet classifiers have great potential in general new physics searches.
    Addressing Bias in Visualization Recommenders by Identifying Trends in Training Data: Improving VizML Through a Statistical Analysis of the Plotly Community Feed. (arXiv:2203.04937v1 [cs.IR])
    Machine learning is a promising approach to visualization recommendation due to its high scalability and representational power. Researchers can create a neural network to predict visualizations from input data by training it over a corpus of datasets and visualization examples. However, these machine learning models can reflect trends in their training data that may negatively affect their performance. Our research project aims to address training bias in machine learning visualization recommendation systems by identifying trends in the training data through statistical analysis.
    Cyclic Coordinate Dual Averaging with Extrapolation for Generalized Variational Inequalities. (arXiv:2102.13244v3 [math.OC] UPDATED)
    Cyclic block coordinate methods are a fundamental class of optimization methods widely used in practice and implemented as part of standard software packages for statistical learning. Nevertheless, their convergence is generally not well understood and so far their good practical performance has not been explained by existing convergence analyses. In this work, we introduce a new block coordinate method that applies to the general class of variational inequality (VI) problems with monotone operators. This class includes composite convex optimization problems and convex-concave min-max optimization problems as special cases and has not been addressed by the existing work. The resulting convergence bounds match the optimal convergence bounds of full gradient methods, but are provided in terms of a novel gradient Lipschitz condition w.r.t.~a Mahalanobis norm. For $m$ coordinate blocks, the resulting gradient Lipschitz constant in our bounds is never larger than a factor $\sqrt{m}$ compared to the traditional Euclidean Lipschitz constant, while it is possible for it to be much smaller. Further, for the case when the operator in the VI has finite-sum structure, we propose a variance reduced variant of our method which further decreases the per-iteration cost and has better convergence rates in certain regimes. To obtain these results, we use a gradient extrapolation strategy that allows us to view a cyclic collection of block coordinate-wise gradients as one implicit gradient.
    Accelerated Primal-Dual Gradient Method for Smooth and Convex-Concave Saddle-Point Problems with Bilinear Coupling. (arXiv:2112.15199v2 [math.OC] UPDATED)
    In this paper we study the convex-concave saddle-point problem $\min_x \max_y f(x) + y^T \mathbf{A} x - g(y)$, where $f(x)$ and $g(y)$ are smooth and convex functions. We propose an Accelerated Primal-Dual Gradient Method (APDG) for solving this problem, achieving (i) an optimal linear convergence rate in the strongly-convex-strongly-concave regime, matching the lower complexity bound (Zhang et al., 2021), and (ii) an accelerated linear convergence rate in the case when only one of the functions $f(x)$ and $g(y)$ is strongly convex or even none of them are. Finally, we obtain a linearly convergent algorithm for the general smooth and convex-concave saddle point problem $\min_x \max_y F(x,y)$ without the requirement of strong convexity or strong concavity.
    Machine Learning in NextG Networks via Generative Adversarial Networks. (arXiv:2203.04453v1 [cs.LG])
    Generative Adversarial Networks (GANs) are Machine Learning (ML) algorithms that have the ability to address competitive resource allocation problems together with detection and mitigation of anomalous behavior. In this paper, we investigate their use in next-generation (NextG) communications within the context of cognitive networks to address i) spectrum sharing, ii) detecting anomalies, and iii) mitigating security attacks. GANs have the following advantages. First, they can learn and synthesize field data, which can be costly, time consuming, and nonrepeatable. Second, they enable pre-training classifiers by using semi-supervised data. Third, they facilitate increased resolution. Fourth, they enable the recovery of corrupted bits in the spectrum. The paper provides the basics of GANs, a comparative discussion on different kinds of GANs, performance measures for GANs in computer vision and image processing as well as wireless applications, a number of datasets for wireless applications, performance measures for general classifiers, a survey of the literature on GANs for i)-iii) above, and future research directions. As a use case of GAN for NextG communications, we show that a GAN can be effectively applied for anomaly detection in signal classification (e.g., user authentication) outperforming another state-of-the-art ML technique such as an autoencoder.
    Understanding the Performance of Knowledge Graph Embeddings in Drug Discovery. (arXiv:2105.10488v3 [q-bio.BM] UPDATED)
    Knowledge Graphs (KG) and associated Knowledge Graph Embedding (KGE) models have recently begun to be explored in the context of drug discovery and have the potential to assist in key challenges such as target identification. In the drug discovery domain, KGs can be employed as part of a process which can result in lab-based experiments being performed, or impact on other decisions, incurring significant time and financial costs and most importantly, ultimately influencing patient healthcare. For KGE models to have impact in this domain, a better understanding of not only of performance, but also the various factors which determine it, is required. In this study we investigate, over the course of many thousands of experiments, the predictive performance of five KGE models on two public drug discovery-oriented KGs. Our goal is not to focus on the best overall model or configuration, instead we take a deeper look at how performance can be affected by changes in the training setup, choice of hyperparameters, model parameter initialisation seed and different splits of the datasets. Our results highlight that these factors have significant impact on performance and can even affect the ranking of models. Indeed these factors should be reported along with model architectures to ensure complete reproducibility and fair comparisons of future work, and we argue this is critical for the acceptance of use, and impact of KGEs in a biomedical setting.
    MAF-GNN: Multi-adaptive Spatiotemporal-flow Graph Neural Network for Traffic Speed Forecasting. (arXiv:2108.03594v2 [cs.LG] UPDATED)
    Traffic forecasting is a core element of intelligent traffic monitoring system. Approaches based on graph neural networks have been widely used in this task to effectively capture spatial and temporal dependencies of road networks. However, these approaches can not effectively define the complicated network topology. Besides, their cascade network structures have limitations in transmitting distinct features in the time and space dimensions. In this paper, we propose a Multi-adaptive Spatiotemporal-flow Graph Neural Network (MAF-GNN) for traffic speed forecasting. MAF-GNN introduces an effective Multi-adaptive Adjacency Matrices Mechanism to capture multiple latent spatial dependencies between traffic nodes. Additionally, we propose Spatiotemporal-flow Modules aiming to further enhance feature propagation in both time and space dimensions. MAF-GNN achieves better performance than other models on two real-world datasets of public traffic network, METR-LA and PeMS-Bay, demonstrating the effectiveness of the proposed approach.
    Simplest Streaming Trees. (arXiv:2110.08483v4 [cs.LG] UPDATED)
    Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, specifically on tabular data. However, current standard implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this limitation. Nonetheless, we found that those state-of-the-art algorithms suffer from a number of drawbacks, including performing very poorly on some problems and requiring a huge amount of memory on others. We therefore developed the simplest possible extension of decision trees we could think of: given new data, simply update existing trees by continuing to grow them, and replace some old trees with new ones to control the total number of trees. On three standard datasets, we illustrate that our approach, Stream Decision Forest (SDF), does not suffer from either of the aforementioned limitations. In a benchmark suite containing 72 classification problems (the OpenML-CC18 data suite), we illustrate that our approach often performs as well, and sometimes better even, than the batch mode decision forest algorithm. Thus, we believe SDFs establish a simple standard for streaming trees and forests that could readily be applied to many real-world problems, including those with distribution drift and continual learning.
    Active Learning for Domain Adaptation: An Energy-Based Approach. (arXiv:2112.01406v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit \textit{free energy biases} when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of target data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at https://github.com/BIT-DA/EADA.
    Accelerated Jarzynski Estimator with Deterministic Virtual Trajectories. (arXiv:2103.00529v2 [cond-mat.stat-mech] UPDATED)
    The Jarzynski estimator is a powerful tool that uses nonequilibrium statistical physics to numerically obtain partition functions of probability distributions. The estimator reconstructs partition functions with trajectories of the simulated Langevin dynamics through the Jarzynski equality. However, the original estimator suffers from slow convergence because it depends on rare trajectories of stochastic dynamics. In this paper, we present a method to significantly accelerate the convergence by introducing deterministic virtual trajectories generated in augmented state space under the Hamiltonian dynamics. We theoretically show that our approach achieves second-order acceleration compared to a naive estimator with the Langevin dynamics and zero variance estimation on harmonic potentials. We also present numerical experiments on three multimodal distributions and a practical example where the proposed method outperforms the conventional method, and provide theoretical explanations.
    A review of deep learning methods for MRI reconstruction. (arXiv:2109.08618v2 [eess.IV] UPDATED)
    Following the success of deep learning in a wide range of applications, neural network-based machine-learning techniques have received significant interest for accelerating magnetic resonance imaging (MRI) acquisition and reconstruction strategies. A number of ideas inspired by deep learning techniques for computer vision and image processing have been successfully applied to nonlinear image reconstruction in the spirit of compressed sensing for accelerated MRI. Given the rapidly growing nature of the field, it is imperative to consolidate and summarize the large number of deep learning methods that have been reported in the literature, to obtain a better understanding of the field in general. This article provides an overview of the recent developments in neural-network based approaches that have been proposed specifically for improving parallel imaging. A general background and introduction to parallel MRI is also given from a classical view of k-space based reconstruction methods. Image domain based techniques that introduce improved regularizers are covered along with k-space based methods which focus on better interpolation strategies using neural networks. While the field is rapidly evolving with plenty of papers published each year, in this review, we attempt to cover broad categories of methods that have shown good performance on publicly available data sets. Limitations and open problems are also discussed and recent efforts for producing open data sets and benchmarks for the community are examined.
    Trajectory Prediction with Linguistic Representations. (arXiv:2110.09741v2 [cs.RO] UPDATED)
    Language allows humans to build mental models that interpret what is happening around them resulting in more accurate long-term predictions. We present a novel trajectory prediction model that uses linguistic intermediate representations to forecast trajectories, and is trained using trajectory samples with partially-annotated captions. The model learns the meaning of each of the words without direct per-word supervision. At inference time, it generates a linguistic description of trajectories which captures maneuvers and interactions over an extended time interval. This generated description is used to refine predictions of the trajectories of multiple agents. We train and validate our model on the Argoverse dataset, and demonstrate improved accuracy results in trajectory prediction. In addition, our model is more interpretable: it presents part of its reasoning in plain language as captions, which can aid model development and can aid in building confidence in the model before deploying it.
    Machine Learning based Optimal Feedback Control for Microgrid Stabilization. (arXiv:2203.04815v1 [eess.SY])
    Microgrids have more operational flexibilities as well as uncertainties than conventional power grids, especially when renewable energy resources are utilized. An energy storage based feedback controller can compensate undesired dynamics of a microgrid to improve its stability. However, the optimal feedback control of a microgrid subject to a large disturbance needs to solve a Hamilton-Jacobi-Bellman problem. This paper proposes a machine learning-based optimal feedback control scheme. Its training dataset is generated from a linear-quadratic regulator and a brute-force method respectively addressing small and large disturbances. Then, a three-layer neural network is constructed from the data for the purpose of optimal feedback control. A case study is carried out for a microgrid model based on a modified Kundur two-area system to test the real-time performance of the proposed control scheme.
    Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers. (arXiv:2203.04913v1 [cs.CV])
    Algorithmic fairness is frequently motivated in terms of a trade-off in which overall performance is decreased so as to improve performance on disadvantaged groups where the algorithm would otherwise be less accurate. Contrary to this, we find that applying existing fairness approaches to computer vision improve fairness by degrading the performance of classifiers across all groups (with increased degradation on the best performing groups). Extending the bias-variance decomposition for classification to fairness, we theoretically explain why the majority of fairness classifiers designed for low capacity models should not be used in settings involving high-capacity models, a scenario common to computer vision. We corroborate this analysis with extensive experimental support that shows that many of the fairness heuristics used in computer vision also degrade performance on the most disadvantaged groups. Building on these insights, we propose an adaptive augmentation strategy that, uniquely, of all methods tested, improves performance for the disadvantaged groups.
    Boundary Conditions for Linear Exit Time Gradient Trajectories Around Saddle Points: Analysis and Algorithm. (arXiv:2101.02625v2 [math.OC] UPDATED)
    Gradient-related first-order methods have become the workhorse of large-scale numerical optimization problems. Many of these problems involve nonconvex objective functions with multiple saddle points, which necessitates an understanding of the behavior of discrete trajectories of first-order methods within the geometrical landscape of these functions. This paper concerns convergence of first-order discrete methods to a local minimum of nonconvex optimization problems that comprise strict-saddle points within the geometrical landscape. To this end, it focuses on analysis of discrete gradient trajectories around saddle neighborhoods, derives sufficient conditions under which these trajectories can escape strict-saddle neighborhoods in linear time, explores the contractive and expansive dynamics of these trajectories in neighborhoods of strict-saddle points that are characterized by gradients of moderate magnitude, characterizes the non-curving nature of these trajectories, and highlights the inability of these trajectories to re-enter the neighborhoods around strict-saddle points after exiting them. Based on these insights and analyses, the paper then proposes a simple variant of the vanilla gradient descent algorithm, termed Curvature Conditioned Regularized Gradient Descent (CCRGD) algorithm, which utilizes a check for an initial boundary condition to ensure its trajectories can escape strict-saddle neighborhoods in linear time. Convergence analysis of the CCRGD algorithm, which includes its rate of convergence to a local minimum, is also presented in the paper. Numerical experiments are then provided on a test function as well as a low-rank matrix factorization problem to evaluate the efficacy of the proposed algorithm.
    Automatic Language Identification for Celtic Texts. (arXiv:2203.04831v1 [cs.CL])
    Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods. The analysis showed that the unsupervised features could serve as a valuable extension to the n-gram feature vectors. It led to an improvement in performance for more entangled classes. The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network consistently outperformed the SVM model. The low-resource languages are also challenging due to the scarcity of available annotated training data. This work evaluated the performance of the classifiers using the unsupervised feature extraction on the reduced labelled dataset to handle this issue. The results uncovered that the unsupervised feature vectors are more robust to the labelled set reduction. Therefore, they proved to help achieve comparable classification performance with much less labelled data.
    Efficient Image Representation Learning with Federated Sampled Softmax. (arXiv:2203.04888v1 [cs.LG])
    Learning image representations on decentralized data can bring many benefits in cases where data cannot be aggregated across data silos. Softmax cross entropy loss is highly effective and commonly used for learning image representations. Using a large number of classes has proven to be particularly beneficial for the descriptive power of such representations in centralized learning. However, doing so on decentralized data with Federated Learning is not straightforward as the demand on FL clients' computation and communication increases proportionally to the number of classes. In this work we introduce federated sampled softmax (FedSS), a resource-efficient approach for learning image representation with Federated Learning. Specifically, the FL clients sample a set of classes and optimize only the corresponding model parameters with respect to a sampled softmax objective that approximates the global full softmax objective. We examine the loss formulation and empirically show that our method significantly reduces the number of parameters transferred to and optimized by the client devices, while performing on par with the standard full softmax method. This work creates a possibility for efficiently learning image representations on decentralized data with a large number of classes under the federated setting.
    A Survey on Reinforcement Learning Methods in Character Animation. (arXiv:2203.04735v1 [cs.GR])
    Reinforcement Learning is an area of Machine Learning focused on how agents can be trained to make sequential decisions, and achieve a particular goal within an arbitrary environment. While learning, they repeatedly take actions based on their observation of the environment, and receive appropriate rewards which define the objective. This experience is then used to progressively improve the policy controlling the agent's behavior, typically represented by a neural network. This trained module can then be reused for similar problems, which makes this approach promising for the animation of autonomous, yet reactive characters in simulators, video games or virtual reality environments. This paper surveys the modern Deep Reinforcement Learning methods and discusses their possible applications in Character Animation, from skeletal control of a single, physically-based character to navigation controllers for individual agents and virtual crowds. It also describes the practical side of training DRL systems, comparing the different frameworks available to build such agents.
    Free Probability, Newton lilypads and hyperbolicity of Jacobians as a solution to the problem of tuning the architecture of neural networks. (arXiv:2111.00841v2 [stat.ML] UPDATED)
    Gradient descent during the learning process of a neural network can be subject to many instabilities. The spectral density of the Jacobian is a key component for analyzing robustness. Following the works of Pennington et al., such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory (FPT). We present a reliable and very fast method for computing the associated spectral densities. This method has a controlled and proven convergence. Our technique is based on an homotopy method: it is an adaptative Newton-Raphson scheme which chains basins of attraction. We find contiguous lilypad-like basins and step from one to the next, heading towards the objective. In order to demonstrate the applicability of our method we show that the relevant FPT metrics computed before training are highly correlated to final test losses -- up to $85 \%$. We also give evidence that a very desirable feature for neural networks is the hyperbolicity of their Jacobian at initialization.
    Downstream Fairness Caveats with Synthetic Healthcare Data. (arXiv:2203.04462v1 [cs.LG])
    This paper evaluates synthetically generated healthcare data for biases and investigates the effect of fairness mitigation techniques on utility-fairness. Privacy laws limit access to health data such as Electronic Medical Records (EMRs) to preserve patient privacy. Albeit essential, these laws hinder research reproducibility. Synthetic data is a viable solution that can enable access to data similar to real healthcare data without privacy risks. Healthcare datasets may have biases in which certain protected groups might experience worse outcomes than others. With the real data having biases, the fairness of synthetically generated health data comes into question. In this paper, we evaluate the fairness of models generated on two healthcare datasets for gender and race biases. We generate synthetic versions of the dataset using a Generative Adversarial Network called HealthGAN, and compare the real and synthetic model's balanced accuracy and fairness scores. We find that synthetic data has different fairness properties compared to real data and fairness mitigation techniques perform differently, highlighting that synthetic data is not bias free.
    The Fairness Field Guide: Perspectives from Social and Formal Sciences. (arXiv:2201.05216v2 [cs.AI] UPDATED)
    Over the past several years, a slew of different methods to measure the fairness of a machine learning model have been proposed. However, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of fair machine learning with the social sciences of philosophy, sociology, and law. We hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of fair machine learning produced by both social and formal (specifically machine learning and statistics) sciences in this field guide. Specifically, in addition to giving the mathematical and algorithmic backgrounds of several popular statistical and causal-based fair machine learning methods, we explain the underlying philosophical and legal thoughts that support them. Further, we explore several criticisms of the current approaches to fair machine learning from sociological and philosophical viewpoints. It is our hope that this field guide will help fair machine learning practitioners better understand how their algorithms align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces.  ( 2 min )
    Renyi Fair Information Bottleneck for Image Classification. (arXiv:2203.04950v1 [cs.LG])
    We develop a novel method for ensuring fairness in machine learning which we term as the Renyi Fair Information Bottleneck (RFIB). We consider two different fairness constraints - demographic parity and equalized odds - for learning fair representations and derive a loss function via a variational approach that uses Renyi's divergence with its tunable parameter $\alpha$ and that takes into account the triple constraints of utility, fairness, and compactness of representation. We then evaluate the performance of our method for image classification using the EyePACS medical imaging dataset, showing it outperforms competing state of the art techniques with performance measured using a variety of compound utility/fairness metrics, including accuracy gap and Rawls' minimal accuracy.  ( 2 min )
    Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data. (arXiv:2203.04386v1 [cs.LG])
    Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feature dimension, which could be addressed using automatic discovery of divergent subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features that could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resource to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different from filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of feature selection time by a factor of 81x and 104x, averaged cross the existing methods in the MIMIC-III and Claims datasets respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3% of features selected by SAFS in the Claims dataset detected divergent samples similar to those detected by using the whole features with a Jaccard similarity of 0.95 but with a 16x reduction in detection time.  ( 2 min )
    Disrupting Adversarial Transferability in Deep Neural Networks. (arXiv:2108.12492v2 [cs.LG] UPDATED)
    Adversarial attack transferability is well-recognized in deep learning. Prior work has partially explained transferability by recognizing common adversarial subspaces and correlations between decision boundaries, but little is known beyond this. We propose that transferability between seemingly different models is due to a high linear correlation between the feature sets that different networks extract. In other words, two models trained on the same task that are distant in the parameter space likely extract features in the same fashion, just with trivial affine transformations between the latent spaces. Furthermore, we show how applying a feature correlation loss, which decorrelates the extracted features in a latent space, can reduce the transferability of adversarial attacks between models, suggesting that the models complete tasks in semantically different ways. Finally, we propose a Dual Neck Autoencoder (DNA), which leverages this feature correlation loss to create two meaningfully different encodings of input information with reduced transferability.  ( 2 min )
    Deep Learning for Sleep Stages Classification: Modified Rectified Linear Unit Activation Function and Modified Orthogonal Weight Initialisation. (arXiv:2203.04371v1 [eess.SP])
    Background and Aim: Each stage of sleep can affect human health, and not getting enough sleep at any stage may lead to sleep disorder like parasomnia, apnea, insomnia, etc. Sleep-related diseases could be diagnosed using Convolutional Neural Network Classifier. However, this classifier has not been successfully implemented into sleep stage classification systems due to high complexity and low accuracy of classification. The aim of this research is to increase the accuracy and reduce the learning time of Convolutional Neural Network Classifier. Methodology: The proposed system used a modified Orthogonal Convolutional Neural Network and a modified Adam optimisation technique to improve the sleep stage classification accuracy and reduce the gradient saturation problem that occurs due to sigmoid activation function. The proposed system uses Leaky Rectified Linear Unit (ReLU) instead of sigmoid activation function as an activation function. Results: The proposed system called Enhanced Sleep Stage Classification system (ESSC) used six different databases for training and testing the proposed model on the different sleep stages. These databases are University College Dublin database (UCD), Beth Israel Deaconess Medical Center MIT database (MIT-BIH), Sleep European Data Format (EDF), Sleep EDF Extended, Montreal Archive of Sleep Studies (MASS), and Sleep Heart Health Study (SHHS). Our results show that the gradient saturation problem does not exist anymore. The modified Adam optimiser helps to reduce the noise which in turn result in faster convergence time. Conclusion: The convergence speed of ESSC is increased along with better classification accuracy compared to the state of art solution.  ( 2 min )
    Learning low-degree functions from a logarithmic number of random queries. (arXiv:2109.10162v3 [cs.LG] UPDATED)
    We prove that every bounded function $f:\{-1,1\}^n\to[-1,1]$ of degree at most $d$ can be learned with $L_2$-accuracy $\varepsilon$ and confidence $1-\delta$ from $\log(\tfrac{n}{\delta})\,\varepsilon^{-d-1} C^{d^{3/2}\sqrt{\log d}}$ random queries, where $C>1$ is a universal finite constant.  ( 2 min )
    Autoregressive based Drift Detection Method. (arXiv:2203.04769v1 [stat.ML])
    In the classic machine learning framework, models are trained on historical data and used to predict future values. It is assumed that the data distribution does not change over time (stationarity). However, in real-world scenarios, the data generation process changes over time and the model has to adapt to the new incoming data. This phenomenon is known as concept drift and leads to a decrease in the predictive model's performance. In this study, we propose a new concept drift detection method based on autoregressive models called ADDM. This method can be integrated into any machine learning algorithm from deep neural networks to simple linear regression model. Our results show that this new concept drift detection method outperforms the state-of-the-art drift detection methods, both on synthetic data sets and real-world data sets. Our approach is theoretically guaranteed as well as empirical and effective for the detection of various concept drifts. In addition to the drift detector, we proposed a new method of concept drift adaptation based on the severity of the drift.  ( 2 min )
    Optimal approximation for unconstrained non-submodular minimization. (arXiv:1905.12145v4 [cs.LG] UPDATED)
    Submodular function minimization is well studied, and existing algorithms solve it exactly or up to arbitrary accuracy. However, in many applications, such as structured sparse learning or batch Bayesian optimization, the objective function is not exactly submodular, but close. In this case, no theoretical guarantees exist. Indeed, submodular minimization algorithms rely on intricate connections between submodularity and convexity. We show how these relations can be extended to obtain approximation guarantees for minimizing non-submodular functions, characterized by how close the function is to submodular. We also extend this result to noisy function evaluations. Our approximation results are the first for minimizing non-submodular functions, and are optimal, as established by our matching lower bound.
    CatGCN: Graph Convolutional Networks with Categorical Node Features. (arXiv:2009.05303v4 [cs.LG] UPDATED)
    Recent studies on Graph Convolutional Networks (GCNs) reveal that the initial node representations (i.e., the node representations before the first-time graph convolution) largely affect the final model performance. However, when learning the initial representation for a node, most existing work linearly combines the embeddings of node features, without considering the interactions among the features (or feature embeddings). We argue that when the node features are categorical, e.g., in many real-world applications like user profiling and recommender system, feature interactions usually carry important signals for predictive analytics. Ignoring them will result in suboptimal initial node representation and thus weaken the effectiveness of the follow-up graph convolution. In this paper, we propose a new GCN model named CatGCN, which is tailored for graph learning when the node features are categorical. Specifically, we integrate two ways of explicit interaction modeling into the learning of initial node representation, i.e., local interaction modeling on each pair of node features and global interaction modeling on an artificial feature graph. We then refine the enhanced initial node representations with the neighborhood aggregation-based graph convolution. We train CatGCN in an end-to-end fashion and demonstrate it on semi-supervised node classification. Extensive experiments on three tasks of user profiling (the prediction of user age, city, and purchase level) from Tencent and Alibaba datasets validate the effectiveness of CatGCN, especially the positive effect of performing feature interaction modeling before graph convolution.
    Survival Prediction of Brain Cancer with Incomplete Radiology, Pathology, Genomics, and Demographic Data. (arXiv:2203.04419v1 [cs.LG])
    Integrating cross-department multi-modal data (e.g., radiological, pathological, genomic, and clinical data) is ubiquitous in brain cancer diagnosis and survival prediction. To date, such an integration is typically conducted by human physicians (and panels of experts), which can be subjective and semi-quantitative. Recent advances in multi-modal deep learning, however, have opened a door to leverage such a process to a more objective and quantitative manner. Unfortunately, the prior arts of using four modalities on brain cancer survival prediction are limited by a "complete modalities" setting (i.e., with all modalities available). Thus, there are still open questions on how to effectively predict brain cancer survival from the incomplete radiological, pathological, genomic, and demographic data (e.g., one or more modalities might not be collected for a patient). For instance, should we use both complete and incomplete data, and more importantly, how to use those data? To answer the preceding questions, we generalize the multi-modal learning on cross-department multi-modal data to a missing data setting. Our contribution is three-fold: 1) We introduce optimal multi-modal learning with missing data (MMD) pipeline with optimized hardware consumption and computational efficiency; 2) We extend multi-modal learning on radiological, pathological, genomic, and demographic data into missing data scenarios; 3) a large-scale public dataset (with 962 patients) is collected to systematically evaluate glioma tumor survival prediction using four modalities. The proposed method improved the C-index of survival prediction from 0.7624 to 0.8053.
    Efficient Kirszbraun Extension with Applications to Regression. (arXiv:1905.11930v3 [cs.LG] UPDATED)
    We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.
    IH-GAN: A Conditional Generative Model for Implicit Surface-Based Inverse Design of Cellular Structures. (arXiv:2103.02588v3 [cs.CE] UPDATED)
    Variable-density cellular structures can overcome connectivity and manufacturability issues of topologically optimized structures, particularly those represented as discrete density maps. However, the optimization of such cellular structures is challenging due to the multiscale design problem. Past work addressing this problem generally either only optimizes the volume fraction of single-type unit cells but ignoring the effects of unit cell geometry on properties, or considers the geometry-property relation but builds this relation via heuristics. In contrast, we propose a simple yet more principled way to accurately model the property to geometry mapping using a conditional deep generative model, named Inverse Homogenization Generative Adversarial Network (IH-GAN). It learns the conditional distribution of unit cell geometries given properties and can realize the one-to-many mapping from geometry to properties. We further reduce the complexity of IH-GAN by using the implicit function parameterization to represent unit cell geometries. Results show that our method can 1) generate various unit cells that satisfy given material properties with high accuracy (relative error <5%) and 2) improve the optimized structural performance over the conventional topology-optimized variable-density structure. Specifically, in the minimum compliance example, our IH-GAN generated structure achieves an 84.4% reduction in concentrated stress and an extra 7% reduction in displacement. In the target deformation examples, our IH-GAN generated structure reduces the target matching error by 24.2% and 44.4% for two test cases, respectively. We also demonstrated that the connectivity issue for multi-type unit cells can be solved by transition layer blending.
    Bitcoin Transaction Forecasting with Deep Network Representation Learning. (arXiv:2007.07993v2 [cs.SI] UPDATED)
    Bitcoin and its decentralized computing paradigm for digital currency trading are one of the most disruptive technology in the 21st century. This paper presents a novel approach to developing a Bitcoin transaction forecast model, DLForecast, by leveraging deep neural networks for learning Bitcoin transaction network representations. DLForecast makes three original contributions. First, we explore three interesting properties between Bitcoin transaction accounts: topological connectivity pattern of Bitcoin accounts, transaction amount pattern, and transaction dynamics. Second, we construct a time-decaying reachability graph and a time-decaying transaction pattern graph, aiming at capturing different types of spatial-temporal Bitcoin transaction patterns. Third, we employ node embedding on both graphs and develop a Bitcoin transaction forecasting system between user accounts based on historical transactions with built-in time-decaying factor. To maintain an effective transaction forecasting performance, we leverage the multiplicative model update (MMU) ensemble to combine prediction models built on different transaction features extracted from each corresponding Bitcoin transaction graph. Evaluated on real-world Bitcoin transaction data, we show that our spatial-temporal forecasting model is efficient with fast runtime and effective with forecasting accuracy over 60\% and improves the prediction performance by 50\% when compared to forecasting model built on the static graph baseline.
    Leveraging Online Shopping Behaviors as a Proxy for Personal Lifestyle Choices: New Insights into Chronic Disease Prevention Literacy. (arXiv:2104.14281v5 [cs.CY] UPDATED)
    Objective Ubiquitous internet access is reshaping the way we live, but it is accompanied by unprecedented challenges in preventing chronic diseases that are usually planted by long exposure to unhealthy lifestyles. This paper proposes leveraging online shopping behaviors as a proxy for personal lifestyle choices to improve chronic disease prevention literacy, targeted for times when e-commerce user experience has been assimilated into most people's everyday lives. Methods: Longitudinal query logs and purchase records from 15 million online shoppers were accessed, constructing a broad spectrum of lifestyle features covering various product categories and buyer personas. Using the lifestyle-related information preceding online shoppers' first purchases of specific prescription drugs, we could determine associations between their past lifestyle choices and whether they suffered from a particular chronic disease. Results: Novel lifestyle risk factors were discovered in two exemplars--depression and type 2 diabetes, most of which showed reasonable consistency with existing healthcare knowledge. Further, such empirical findings could be adopted to locate online shoppers at higher risk of these chronic diseases with decent accuracy [i.e., (area under the receiver operating characteristic curve) AUC=0.68 for depression and AUC=0.70 for type 2 diabetes], closely matching the performance of screening surveys benchmarked against medical diagnosis. Conclusions: Mining online shopping behaviors can point medical experts to a series of lifestyle issues associated with chronic diseases that are less explored to date. Hopefully, unobtrusive chronic disease surveillance via e-commerce sites can grant consenting individuals a privilege to be connected more readily with the medical profession and sophistication.
    Embedding Temporal Convolutional Networks for Energy-Efficient PPG-Based Heart Rate Monitoring. (arXiv:2203.04396v1 [eess.SP])
    Photoplethysmography (PPG) sensors allow for non-invasive and comfortable heart-rate (HR) monitoring, suitable for compact wrist-worn devices. Unfortunately, Motion Artifacts (MAs) severely impact the monitoring accuracy, causing high variability in the skin-to-sensor interface. Several data fusion techniques have been introduced to cope with this problem, based on combining PPG signals with inertial sensor data. Until know, both commercial and reasearch solutions are computationally efficient but not very robust, or strongly dependent on hand-tuned parameters, which leads to poor generalization performance. % In this work, we tackle these limitations by proposing a computationally lightweight yet robust deep learning-based approach for PPG-based HR estimation. Specifically, we derive a diverse set of Temporal Convolutional Networks (TCN) for HR estimation, leveraging Neural Architecture Search (NAS). Moreover, we also introduce ActPPG, an adaptive algorithm that selects among multiple HR estimators depending on the amount of MAs, to improve energy efficiency. We validate our approaches on two benchmark datasets, achieving as low as 3.84 Beats per Minute (BPM) of Mean Absolute Error (MAE) on PPGDalia, which outperforms the previous state-of-the-art. Moreover, we deploy our models on a low-power commercial microcontroller (STM32L4), obtaining a rich set of Pareto optimal solutions in the complexity vs. accuracy space.
    Policy Optimization in Dynamic Bayesian Network Hybrid Models of Biomanufacturing Processes. (arXiv:2105.06543v2 [cs.AI] UPDATED)
    Biopharmaceutical manufacturing is a rapidly growing industry with impact in virtually all branches of medicines. Biomanufacturing processes require close monitoring and control, in the presence of complex bioprocess dynamics with many interdependent factors, as well as extremely limited data due to the high cost of experiments as well as the novelty of personalized bio-drugs. We develop a novel model-based reinforcement learning framework that can achieve human-level control in low-data environments. The model uses a dynamic Bayesian network to capture causal interdependencies between factors and predict how the effects of different inputs propagate through the pathways of the bioprocess mechanisms. This enables the design of process control policies that are both interpretable and robust against model risk. We present a computationally efficient, provably convergence stochastic gradient method for optimizing such policies. Validation is conducted on a realistic application with a multi-dimensional, continuous state variable.
    Deep Generative Models for Geometric Design Under Uncertainty. (arXiv:2112.08919v2 [cs.LG] UPDATED)
    Deep generative models have demonstrated effectiveness in learning compact and expressive design representations that significantly improve geometric design optimization. However, these models do not consider the uncertainty introduced by manufacturing or fabrication. Past work that quantifies such uncertainty often makes simplified assumptions on geometric variations, while the "real-world" uncertainty and its impact on design performance are difficult to quantify due to the high dimensionality. To address this issue, we propose a Generative Adversarial Network-based Design under Uncertainty Framework (GAN-DUF), which contains a deep generative model that simultaneously learns a compact representation of nominal (ideal) designs and the conditional distribution of fabricated designs given any nominal design. We demonstrated the framework on two real-world engineering design examples and showed its capability of finding the solution that possesses better performances after fabrication.  ( 2 min )
    Deep learning-based reconstruction of highly accelerated 3D MRI. (arXiv:2203.04674v1 [eess.IV])
    Purpose: To accelerate brain 3D MRI scans by using a deep learning method for reconstructing images from highly-undersampled multi-coil k-space data Methods: DL-Speed, an unrolled optimization architecture with dense skip-layer connections, was trained on 3D T1-weighted brain scan data to reconstruct complex-valued images from highly-undersampled k-space data. The trained model was evaluated on 3D MPRAGE brain scan data retrospectively-undersampled with a 10-fold acceleration, compared to a conventional parallel imaging method with a 2-fold acceleration. Scores of SNR, artifacts, gray/white matter contrast, resolution/sharpness, deep gray-matter, cerebellar vermis, anterior commissure, and overall quality, on a 5-point Likert scale, were assessed by experienced radiologists. In addition, the trained model was tested on retrospectively-undersampled 3D T1-weighted LAVA (Liver Acquisition with Volume Acceleration) abdominal scan data, and prospectively-undersampled 3D MPRAGE and LAVA scans in three healthy volunteers and one, respectively. Results: The qualitative scores for DL-Speed with a 10-fold acceleration were higher than or equal to those for the parallel imaging with 2-fold acceleration. DL-Speed outperformed a compressed sensing method in quantitative metrics on retrospectively-undersampled LAVA data. DL-Speed was demonstrated to perform reasonably well on prospectively-undersampled scan data, realizing a 2-5 times reduction in scan time. Conclusion: DL-Speed was shown to accelerate 3D MPRAGE and LAVA with up to a net 10-fold acceleration, achieving 2-5 times faster scans compared to conventional parallel imaging and acceleration, while maintaining diagnostic image quality and real-time reconstruction. The brain scan data-trained DL-Speed also performed well when reconstructing abdominal LAVA scan data, demonstrating versatility of the network.  ( 2 min )
    Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise. (arXiv:2112.02960v2 [cs.LG] UPDATED)
    Noisy labels damage the performance of deep networks. For robust learning, a prominent two-stage pipeline alternates between eliminating possible incorrect labels and semi-supervised training. However, discarding part of noisy labels could result in a loss of information, especially when the corruption is not completely random, e.g., class-dependent or instance-dependent. Moreover, from the training dynamics of a representative two-stage method DivideMix, we identify the domination of confirmation bias: pseudo-labels fail to correct a considerable amount of noisy labels, and consequently, the errors accumulate. To sufficiently exploit information from noisy labels and mitigate wrong corrections, we propose Robust Label Refurbishment (Robust LR) a new hybrid method that integrates pseudo-labeling and confidence estimation techniques to refurbish noisy labels. We show that our method successfully alleviates the damage of both label noise and confirmation bias. As a result, it achieves state-of-the-art results across datasets and noise types. For example, Robust LR achieves up to 4.5% absolute top-1 accuracy improvement over the previous best on the real-world noisy dataset WebVision.  ( 2 min )
    Why Interpretable Causal Inference is Important for High-Stakes Decision Making for Critically Ill Patients and How To Do It. (arXiv:2203.04920v1 [stat.ME])
    Many fundamental problems affecting the care of critically ill patients lead to similar analytical challenges: physicians cannot easily estimate the effects of at-risk medical conditions or treatments because the causal effects of medical conditions and drugs are entangled. They also cannot easily perform studies: there are not enough high-quality data for high-dimensional observational causal inference, and RCTs often cannot ethically be conducted. However, mechanistic knowledge is available, including how drugs are absorbed into the body, and the combination of this knowledge with the limited data could potentially suffice -- if we knew how to combine them. In this work, we present a framework for interpretable estimation of causal effects for critically ill patients under exactly these complex conditions: interactions between drugs and observations over time, patient data sets that are not large, and mechanistic knowledge that can substitute for lack of data. We apply this framework to an extremely important problem affecting critically ill patients, namely the effect of seizures and other potentially harmful electrical events in the brain (called epileptiform activity -- EA) on outcomes. Given the high stakes involved and the high noise in the data, interpretability is critical for troubleshooting such complex problems. Interpretability of our matched groups allowed neurologists to perform chart reviews to verify the quality of our causal analysis. For instance, our work indicates that a patient who experiences a high level of seizure-like activity (75% high EA burden) and is untreated for a six-hour window, has, on average, a 16.7% increased chance of adverse outcomes such as severe brain damage, lifetime disability, or death. We find that patients with mild but long-lasting EA (average EA burden >= 50%) have their risk of an adverse outcome increased by 11.2%.
    Binary Classification Under $\ell_0$ Attacks for General Noise Distribution. (arXiv:2203.04855v1 [cs.LG])
    Adversarial examples have recently drawn considerable attention in the field of machine learning due to the fact that small perturbations in the data can result in major performance degradation. This phenomenon is usually modeled by a malicious adversary that can apply perturbations to the data in a constrained fashion, such as being bounded in a certain norm. In this paper, we study this problem when the adversary is constrained by the $\ell_0$ norm; i.e., it can perturb a certain number of coordinates in the input, but has no limit on how much it can perturb those coordinates. Due to the combinatorial nature of this setting, we need to go beyond the standard techniques in robust machine learning to address this problem. We consider a binary classification scenario where $d$ noisy data samples of the true label are provided to us after adversarial perturbations. We introduce a classification method which employs a nonlinear component called truncation, and show in an asymptotic scenario, as long as the adversary is restricted to perturb no more than $\sqrt{d}$ data samples, we can almost achieve the optimal classification error in the absence of the adversary, i.e. we can completely neutralize adversary's effect. Surprisingly, we observe a phase transition in the sense that using a converse argument, we show that if the adversary can perturb more than $\sqrt{d}$ coordinates, no classifier can do better than a random guess.  ( 2 min )
    Contextual Networks and Unsupervised Ranking of Sentences. (arXiv:2203.04459v1 [cs.CL])
    We construct a contextual network to represent a document with syntactic and semantic relations between word-sentence pairs, based on which we devise an unsupervised algorithm called CNATAR (Contextual Network And Text Analysis Rank) to score sentences, and rank them through a bi-objective 0-1 knapsack maximization problem over topic analysis and sentence scores. We show that CNATAR outperforms the combined ranking of the three human judges provided on the SummBank dataset under both ROUGE and BLEU metrics, which in term significantly outperforms each individual judge's ranking. Moreover, CNATAR produces so far the highest ROUGE scores over DUC-02, and outperforms previous supervised algorithms on the CNN/DailyMail and NYT datasets. We also compare the performance of CNATAR and the latest supervised neural-network summarization models and compute oracle results.
    Temporal Difference Learning for Model Predictive Control. (arXiv:2203.04955v1 [cs.LG])
    Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases. However, it is both costly to plan over long horizons and challenging to obtain an accurate model of the environment. In this work, we combine the strengths of model-free and model-based methods. We use a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return, both of which are learned jointly by temporal difference learning. Our method, TD-MPC, achieves superior sample efficiency and asymptotic performance over prior work on both state and image-based continuous control tasks from DMControl and Meta-World. Code and video results are available at https://nicklashansen.github.io/td-mpc.  ( 2 min )
    Robust Federated Learning Against Adversarial Attacks for Speech Emotion Recognition. (arXiv:2203.04696v1 [cs.SD])
    Due to the development of machine learning and speech processing, speech emotion recognition has been a popular research topic in recent years. However, the speech data cannot be protected when it is uploaded and processed on servers in the internet-of-things applications of speech emotion recognition. Furthermore, deep neural networks have proven to be vulnerable to human-indistinguishable adversarial perturbations. The adversarial attacks generated from the perturbations may result in deep neural networks wrongly predicting the emotional states. We propose a novel federated adversarial learning framework for protecting both data and deep neural networks. The proposed framework consists of i) federated learning for data privacy, and ii) adversarial training at the training stage and randomisation at the testing stage for model robustness. The experiments show that our proposed framework can effectively protect the speech data locally and improve the model robustness against a series of adversarial attacks.
    Logic-based AI for Interpretable Board Game Winner Prediction with Tsetlin Machine. (arXiv:2203.04378v1 [cs.AI])
    Hex is a turn-based two-player connection game with a high branching factor, making the game arbitrarily complex with increasing board sizes. As such, top-performing algorithms for playing Hex rely on accurate evaluation of board positions using neural networks. However, the limited interpretability of neural networks is problematic when the user wants to understand the reasoning behind the predictions made. In this paper, we propose to use propositional logic expressions to describe winning and losing board game positions, facilitating precise visual interpretation. We employ a Tsetlin Machine (TM) to learn these expressions from previously played games, describing where pieces must be located or not located for a board position to be strong. Extensive experiments on $6\times6$ boards compare our TM-based solution with popular machine learning algorithms like XGBoost, InterpretML, decision trees, and neural networks, considering various board configurations with $2$ to $22$ moves played. On average, the TM testing accuracy is $92.1\%$, outperforming all the other evaluated algorithms. We further demonstrate the global interpretation of the logical expressions and map them down to particular board game configurations to investigate local interpretability. We believe the resulting interpretability establishes building blocks for accurate assistive AI and human-AI collaboration, also for more complex prediction tasks.  ( 2 min )
    Efficient Sub-structured Knowledge Distillation. (arXiv:2203.04825v1 [cs.LG])
    Structured prediction models aim at solving a type of problem where the output is a complex structure, rather than a single variable. Performing knowledge distillation for such models is not trivial due to their exponentially large output space. In this work, we propose an approach that is much simpler in its formulation and far more efficient for training than existing approaches. Specifically, we transfer the knowledge from a teacher model to its student model by locally matching their predictions on all sub-structures, instead of the whole output space. In this manner, we avoid adopting some time-consuming techniques like dynamic programming (DP) for decoding output structures, which permits parallel computation and makes the training process even faster in practice. Besides, it encourages the student model to better mimic the internal behavior of the teacher model. Experiments on two structured prediction tasks demonstrate that our approach outperforms previous methods and halves the time cost for one training epoch.
    A Flexible Multi-Task Model for BERT Serving. (arXiv:2107.05377v2 [cs.CL] UPDATED)
    In this demonstration, we present an efficient BERT-based multi-task (MT) framework that is particularly suitable for iterative and incremental development of the tasks. The proposed framework is based on the idea of partial fine-tuning, i.e. only fine-tune some top layers of BERT while keep the other layers frozen. For each task, we train independently a single-task (ST) model using partial fine-tuning. Then we compress the task-specific layers in each ST model using knowledge distillation. Those compressed ST models are finally merged into one MT model so that the frozen layers of the former are shared across the tasks. We exemplify our approach on eight GLUE tasks, demonstrating that it is able to achieve both strong performance and efficiency. We have implemented our method in the utterance understanding system of XiaoAI, a commercial AI assistant developed by Xiaomi. We estimate that our model reduces the overall serving cost by 86%.  ( 2 min )
    Beam Search for Feature Selection. (arXiv:2203.04350v1 [cs.LG])
    In this paper, we present and prove some consistency results about the performance of classification models using a subset of features. In addition, we propose to use beam search to perform feature selection, which can be viewed as a generalization of forward selection. We apply beam search to both simulated and real-world data, by evaluating and comparing the performance of different classification models using different sets of features. The results demonstrate that beam search could outperform forward selection, especially when the features are correlated so that they have more discriminative power when considered jointly than individually. Moreover, in some cases classification models could obtain comparable performance using only ten features selected by beam search instead of hundreds of original features.
    Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures. (arXiv:2112.04828v2 [stat.ML] UPDATED)
    In this paper we consider how to evaluate survival distribution predictions with measures of discrimination. This is a non-trivial problem as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages. Whilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons. We find that the most robust method of reducing a distribution to a risk is to sum over the predicted cumulative hazard. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation. The code used in the final experiment is available at https://github.com/RaphaelS1/distribution_discrimination.
    A Brain-Inspired Low-Dimensional Computing Classifier for Inference on Tiny Devices. (arXiv:2203.04894v1 [cs.LG])
    By mimicking brain-like cognition and exploiting parallelism, hyperdimensional computing (HDC) classifiers have been emerging as a lightweight framework to achieve efficient on-device inference. Nonetheless, they have two fundamental drawbacks, heuristic training process and ultra-high dimension, which result in sub-optimal inference accuracy and large model sizes beyond the capability of tiny devices with stringent resource constraints. In this paper, we address these fundamental drawbacks and propose a low-dimensional computing (LDC) alternative. Specifically, by mapping our LDC classifier into an equivalent neural network, we optimize our model using a principled training approach. Most importantly, we can improve the inference accuracy while successfully reducing the ultra-high dimension of existing HDC models by orders of magnitude (e.g., 8000 vs. 4/64). We run experiments to evaluate our LDC classifier by considering different datasets for inference on tiny devices, and also implement different models on an FPGA platform for acceleration. The results highlight that our LDC classifier offers an overwhelming advantage over the existing brain-inspired HDC models and is particularly suitable for inference on tiny devices.
    Deep Learning Estimation of Absorbed Dose for Nuclear Medicine Diagnostics. (arXiv:1805.09108v10 [stat.ML] UPDATED)
    The distribution of energy dose from Lu$^{177}$ radiotherapy can be estimated by convolving an image of a time-integrated activity distribution with a dose voxel kernel (DVK) consisting of different types of tissues. This fast and inacurate approximation is inappropriate for personalized dosimetry as it neglects tissue heterogenity. The latter can be calculated using different imaging techniques such as CT and SPECT combined with a time consuming monte-carlo simulation. The aim of this study is, for the first time, an estimation of DVKs from CT-derived density kernels (DK) via deep learning in convolutional neural networks (CNNs). The proposed CNN achieved, on the test set, a mean intersection over union (IOU) of $= 0.86$ after $308$ epochs and a corresponding mean squared error (MSE) $= 1.24 \cdot 10^{-4}$. This generalization ability shows that the trained CNN can indeed learn the difficult transfer function from DK to DVK. Future work will evaluate DVKs estimated by CNNs with full monte-carlo simulations of a whole body CT to predict patient specific voxel dose maps.
    Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences. (arXiv:2203.04738v1 [cs.CV])
    Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$\times$ speedup over a serial approach. As efficiency of our new parallelization strategy is associated with the sequence length, our parallel GRU algorithm achieves significant performance improvement as the sequence length increases.
    Geometric Optimisation on Manifolds with Applications to Deep Learning. (arXiv:2203.04794v1 [math.OC])
    We design and implement a Python library to help the non-expert using all these powerful tools in a way that is efficient, extensible, and simple to incorporate into the workflow of the data scientist, practitioner, and applied researcher. The algorithms implemented in this library have been designed with usability and GPU efficiency in mind, and they can be added to any PyTorch model with just one extra line of code. We showcase the effectiveness of these tools on an application of optimisation on manifolds in the setting of time series analysis. In this setting, orthogonal and unitary optimisation is used to constraint and regularise recurrent models and avoid vanishing and exploding gradient problems. The algorithms designed for GeoTorch allow us to achieve state of the art results in the standard tests for this family of models. We use tools from comparison geometry to give bounds on quantities that are of interest in optimisation problems. In particular, we build on the work of (Kaul 1976) to give explicit bounds on the norm of the second derivative of the Riemannian exponential.  ( 2 min )
    QFold: Quantum Walks and Deep Learning to Solve Protein Folding. (arXiv:2101.10279v2 [quant-ph] UPDATED)
    Predicting the 3D structure of proteins is one of the most important problems in current biochemical research. In this article, we explain how to combine recent deep learning advances with the well known technique of quantum walks applied to a Metropolis algorithm. The result, QFold, is a fully scalable hybrid quantum algorithm that, in contrast to previous quantum approaches, does not require a lattice model simplification and instead relies on the much more realistic assumption of parameterization in terms of torsion angles of the amino acids. We compare it with its classical analog for different annealing schedules and find a polynomial quantum advantage, and implement a minimal realization of the quantum Metropolis in the IBMQ Casablanca quantum system.  ( 2 min )
    Monitoring Time Series With Missing Values: a Deep Probabilistic Approach. (arXiv:2203.04916v1 [stat.ML])
    Systems are commonly monitored for health and security through collection and streaming of multivariate time series. Advances in time series forecasting due to adoption of multilayer recurrent neural network architectures make it possible to forecast in high-dimensional time series, and identify and classify novelties early, based on subtle changes in the trends. However, mainstream approaches to multi-variate time series predictions do not handle well cases when the ongoing forecast must include uncertainty, nor they are robust to missing data. We introduce a new architecture for time series monitoring based on combination of state-of-the-art methods of forecasting in high-dimensional time series with full probabilistic handling of uncertainty. We demonstrate advantage of the architecture for time series forecasting and novelty detection, in particular with partially missing data, and empirically evaluate and compare the architecture to state-of-the-art approaches on a real-world data set.  ( 2 min )
    Convergence of the Deep BSDE Method for Coupled FBSDEs. (arXiv:1811.01165v4 [math.PR] UPDATED)
    The recently proposed numerical algorithm, deep BSDE method, has shown remarkable performance in solving high-dimensional forward-backward stochastic differential equations (FBSDEs) and parabolic partial differential equations (PDEs). This article lays a theoretical foundation for the deep BSDE method in the general case of coupled FBSDEs. In particular, a posteriori error estimation of the solution is provided and it is proved that the error converges to zero given the universal approximation capability of neural networks. Numerical results are presented to demonstrate the accuracy of the analyzed algorithm in solving high-dimensional coupled FBSDEs.  ( 2 min )
    LSTMSPLIT: Effective SPLIT Learning based LSTM on Sequential Time-Series Data. (arXiv:2203.04305v1 [cs.LG])
    Federated learning (FL) and split learning (SL) are the two popular distributed machine learning (ML) approaches that provide some data privacy protection mechanisms. In the time-series classification problem, many researchers typically use 1D convolutional neural networks (1DCNNs) based on the SL approach with a single client to reduce the computational overhead at the client-side while still preserving data privacy. Another method, recurrent neural network (RNN), is utilized on sequentially partitioned data where segments of multiple-segment sequential data are distributed across various clients. However, to the best of our knowledge, it is still not much work done in SL with long short-term memory (LSTM) network, even the LSTM network is practically effective in processing time-series data. In this work, we propose a new approach, LSTMSPLIT, that uses SL architecture with an LSTM network to classify time-series data with multiple clients. The differential privacy (DP) is applied to solve the data privacy leakage. The proposed method, LSTMSPLIT, has achieved better or reasonable accuracy compared to the Split-1DCNN method using the electrocardiogram dataset and the human activity recognition dataset. Furthermore, the proposed method, LSTMSPLIT, can also achieve good accuracy after applying differential privacy to preserve the user privacy of the cut layer of the LSTMSPLIT.  ( 2 min )
    Variational Inference with Locally Enhanced Bounds for Hierarchical Models. (arXiv:2203.04432v1 [cs.LG])
    Hierarchical models represent a challenging setting for inference algorithms. MCMC methods struggle to scale to large models with many local variables and observations, and variational inference (VI) may fail to provide accurate approximations due to the use of simple variational families. Some variational methods (e.g. importance weighted VI) integrate Monte Carlo methods to give better accuracy, but these tend to be unsuitable for hierarchical models, as they do not allow for subsampling and their performance tends to degrade for high dimensional models. We propose a new family of variational bounds for hierarchical models, based on the application of tightening methods (e.g. importance weighting) separately for each group of local random variables. We show that our approach naturally allows the use of subsampling to get unbiased gradients, and that it fully leverages the power of methods that build tighter lower bounds by applying them independently in lower dimensional spaces, leading to better results and more accurate posterior approximations than relevant baselines.
    Markov Random Geometric Graph (MRGG): A Growth Model for Temporal Dynamic Networks. (arXiv:2006.07001v3 [cs.LG] UPDATED)
    We introduce Markov Random Geometric Graphs (MRGGs), a growth model for temporal dynamic networks. It is based on a Markovian latent space dynamic: consecutive latent points are sampled on the Euclidean Sphere using an unknown Markov kernel; and two nodes are connected with a probability depending on a unknown function of their latent geodesic distance. More precisely, at each stamp-time $k$ we add a latent point $X_k$ sampled by jumping from the previous one $X_{k-1}$ in a direction chosen uniformly $Y_k$ and with a length $r_k$ drawn from an unknown distribution called the latitude function. The connection probabilities between each pair of nodes are equal to the envelope function of the distance between these two latent points. We provide theoretical guarantees for the non-parametric estimation of the latitude and the envelope functions.We propose an efficient algorithm that achieves those non-parametric estimation tasks based on an ad-hoc Hierarchical Agglomerative Clustering approach. As a by product, we show how MRGGs can be used to detect dependence structure in growing graphs and to solve link prediction problems.
    What Matters For Meta-Learning Vision Regression Tasks?. (arXiv:2203.04905v1 [cs.CV])
    Meta-learning is widely used in few-shot classification and function regression due to its ability to quickly adapt to unseen tasks. However, it has not yet been well explored on regression tasks with high dimensional inputs such as images. This paper makes two main contributions that help understand this barely explored area. \emph{First}, we design two new types of cross-category level vision regression tasks, namely object discovery and pose estimation of unprecedented complexity in the meta-learning domain for computer vision. To this end, we (i) exhaustively evaluate common meta-learning techniques on these tasks, and (ii) quantitatively analyze the effect of various deep learning techniques commonly used in recent meta-learning algorithms in order to strengthen the generalization capability: data augmentation, domain randomization, task augmentation and meta-regularization. Finally, we (iii) provide some insights and practical recommendations for training meta-learning algorithms on vision regression tasks. \emph{Second}, we propose the addition of functional contrastive learning (FCL) over the task representations in Conditional Neural Processes (CNPs) and train in an end-to-end fashion. The experimental results show that the results of prior work are misleading as a consequence of a poor choice of the loss function as well as too small meta-training sets. Specifically, we find that CNPs outperform MAML on most tasks without fine-tuning. Furthermore, we observe that naive task augmentation without a tailored design results in underfitting.
    The Severity Prediction of The Binary And Multi-Class Cardiovascular Disease -- A Machine Learning-Based Fusion Approach. (arXiv:2203.04921v1 [cs.LG])
    In today's world, a massive amount of data is available in almost every sector. This data has become an asset as we can use this enormous amount of data to find information. Mainly health care industry contains many data consisting of patient and disease-related information. By using the machine learning technique, we can look for hidden data patterns to predict various diseases. Recently CVDs, or cardiovascular disease, have become a leading cause of death around the world. The number of death due to CVDs is frightening. That is why many researchers are trying their best to design a predictive model that can save many lives using the data mining model. In this research, some fusion models have been constructed to diagnose CVDs along with its severity. Machine learning(ML) algorithms like artificial neural network, SVM, logistic regression, decision tree, random forest, and AdaBoost have been applied to the heart disease dataset to predict disease. Randomoversampler was implemented because of the class imbalance in multiclass classification. To improve the performance of classification, a weighted score fusion approach was taken. At first, the models were trained. After training, two algorithms' decision was combined using a weighted sum rule. A total of three fusion models have been developed from the six ML algorithms. The results were promising in the performance parameter. The proposed approach has been experimented with different test training ratios for binary and multiclass classification problems, and for both of them, the fusion models performed well. The highest accuracy for multiclass classification was found as 75%, and it was 95% for binary. The code can be found in : https://github.com/hafsa-kibria/Weighted_score_fusion_model_heart_disease_prediction
    Data-Efficient Structured Pruning via Submodular Optimization. (arXiv:2203.04940v1 [cs.LG])
    Structured pruning is an effective approach for compressing large pre-trained neural networks without significantly affecting their performance, which involves removing redundant regular regions of weights. However, current structured pruning methods are highly empirical in nature, do not provide any theoretical guarantees, and often require fine-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efficient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer's input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is one of the few in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms popular baseline methods in various one-shot pruning settings.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v2 [cs.LG] UPDATED)
    We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    Structured Multi-task Learning for Molecular Property Prediction. (arXiv:2203.04695v1 [q-bio.BM])
    Multi-task learning for molecular property prediction is becoming increasingly important in drug discovery. However, in contrast to other domains, the performance of multi-task learning in drug discovery is still not satisfying as the number of labeled data for each task is too limited, which calls for additional data to complement the data scarcity. In this paper, we study multi-task learning for molecular property prediction in a novel setting, where a relation graph between tasks is available. We first construct a dataset including around 400 tasks as well as a task relation graph. Then to better utilize such relation graph, we propose a method called SGNN-EBM to systematically investigate the structured task modeling from two perspectives. (1) In the \emph{latent} space, we model the task representations by applying a state graph neural network (SGNN) on the relation graph. (2) In the \emph{output} space, we employ structured prediction with the energy-based model (EBM), which can be efficiently trained through noise-contrastive estimation (NCE) approach. Empirical results justify the effectiveness of SGNN-EBM. Code is available on https://github.com/chao1224/SGNN-EBM.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v1 [stat.ML])
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in the models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper analyzes data representativity in scientific literature related to AI and sampling, and gives a brief overview of statistical sampling methodology from disciplines like sampling of physical materials, experimental design, survey analysis, and observational studies. Different notions of a 'representative sample' exist in past and present literature. In particular, the contrast between the notion of a representative sample in the sense of coverage of the input space, versus a representative sample as a miniature of the target population is of relevance when building AI systems. Using empirical demonstrations on US Census data, we demonstrate that the first is useful for providing equality and demographic parity, and is more robust to distribution shifts, whereas the latter notion is useful in situations where the purpose is to make historical inference or draw inference about the underlying population in general, or make better predictions for the majority in the underlying population. We propose a framework of questions for creating and documenting data, with data representativity in mind, as an addition to existing datasheets for datasets. Finally, we will also like to call for caution of implicit, in addition to explicit, use of a notion of data representativeness without specific clarification.  ( 2 min )
    Scalable Cross Validation Losses for Gaussian Process Models. (arXiv:2105.11535v2 [stat.ML] UPDATED)
    We introduce a simple and scalable method for training Gaussian process (GP) models that exploits cross-validation and nearest neighbor truncation. To accommodate binary and multi-class classification we leverage P\`olya-Gamma auxiliary variables and variational inference. In an extensive empirical comparison with a number of alternative methods for scalable GP regression and classification, we find that our method offers fast training and excellent predictive performance. We argue that the good predictive performance can be traced to the non-parametric nature of the resulting predictive distributions as well as to the cross-validation loss, which provides robustness against model mis-specification.  ( 2 min )
    Inverse design optimization framework via a two-step deep learning approach: application to a wind turbine airfoil. (arXiv:2108.08500v3 [cs.LG] UPDATED)
    The inverse approach is computationally efficient in aerodynamic design as the desired target performance distribution is prespecified. However, it has some significant limitations that prevent it from achieving full efficiency. First, the iterative procedure should be repeated whenever the specified target distribution changes. Target distribution optimization can be performed to clarify the ambiguity in specifying this distribution, but several additional problems arise in this process such as loss of the representation capacity due to parameterization of the distribution, excessive constraints for a realistic distribution, inaccuracy of quantities of interest due to theoretical/empirical predictions, and the impossibility of explicitly imposing geometric constraints. To deal with these issues, a novel inverse design optimization framework with a two-step deep learning approach is proposed. A variational autoencoder and multi-layer perceptron are used to generate a realistic target distribution and predict the quantities of interest and shape parameters from the generated distribution, respectively. Then, target distribution optimization is performed as the inverse design optimization. The proposed framework applies active learning and transfer learning techniques to improve accuracy and efficiency. Finally, the framework is validated through aerodynamic shape optimizations of the wind turbine airfoil. Their results show that this framework is accurate, efficient, and flexible to be applied to other inverse design engineering applications.
    FEDI: Few-shot learning based on Earth Mover's Distance algorithm combined with deep residual network to identify diabetic retinopathy. (arXiv:2108.09711v2 [eess.IV] UPDATED)
    Diabetic retinopathy(DR) is the main cause of blindness in diabetic patients. However, DR can easily delay the occurrence of blindness through the diagnosis of the fundus. In view of the reality, it is difficult to collect a large amount of diabetic retina data in clinical practice. This paper proposes a few-shot learning model of a deep residual network based on Earth Mover's Distance algorithm to assist in diagnosing DR. We build training and validation classification tasks for few-shot learning based on 39 categories of 1000 sample data, train deep residual networks, and obtain experience maximization pre-training models. Based on the weights of the pre-trained model, the Earth Mover's Distance algorithm calculates the distance between the images, obtains the similarity between the images, and changes the model's parameters to improve the accuracy of the training model. Finally, the experimental construction of the small sample classification task of the test set to optimize the model further, and finally, an accuracy of 93.5667% on the 3way10shot task of the diabetic retina test set. For the experimental code and results, please refer to: https://github.com/panliangrui/few-shot-learning-funds.
    Reverse Engineering $\ell_p$ attacks: A block-sparse optimization approach with recovery guarantees. (arXiv:2203.04886v1 [cs.LG])
    Deep neural network-based classifiers have been shown to be vulnerable to imperceptible perturbations to their input, such as $\ell_p$-bounded norm adversarial attacks. This has motivated the development of many defense methods, which are then broken by new attacks, and so on. This paper focuses on a different but related problem of reverse engineering adversarial attacks. Specifically, given an attacked signal, we study conditions under which one can determine the type of attack ($\ell_1$, $\ell_2$ or $\ell_\infty$) and recover the clean signal. We pose this problem as a block-sparse recovery problem, where both the signal and the attack are assumed to lie in a union of subspaces that includes one subspace per class and one subspace per attack type. We derive geometric conditions on the subspaces under which any attacked signal can be decomposed as the sum of a clean signal plus an attack. In addition, by determining the subspaces that contain the signal and the attack, we can also classify the signal and determine the attack type. Experiments on digit and face classification demonstrate the effectiveness of the proposed approach.
    DISCO: Comprehensive and Explainable Disinformation Detection. (arXiv:2203.04928v1 [cs.LG])
    Disinformation refers to false information deliberately spread to influence the general public, and the negative impact of disinformation on society can be observed for numerous issues, such as political agendas and manipulating financial markets. In this paper, we identify prevalent challenges and advances related to automated disinformation detection from multiple aspects, and propose a comprehensive and explainable disinformation detection framework called DISCO. It leverages the heterogeneity of disinformation and addresses the prediction opaqueness. Then we provide a demonstration of DISCO on a real-world fake news detection task with satisfactory detection accuracy and explanation. The demo video and source code of DISCO is now publicly available. We expect that our demo could pave the way for addressing the limitations of identification, comprehension, and explainability as a whole.
    Data-driven detector signal characterization with constrained bottleneck autoencoders. (arXiv:2203.04604v1 [physics.ins-det])
    A common technique in high energy physics is to characterize the response of a detector by means of models tunned to data which build parametric maps from the physical parameters of the system to the expected signal of the detector. When the underlying model is unknown it is difficult to apply this method, and often, simplifying assumptions are made introducing modeling errors. In this article, using a waveform toy model we present how deep learning in the form of constrained bottleneck autoencoders can be used to learn the underlying unknown detector response model directly from data. The results show that excellent performance results can be achieved even when the signals are significantly affected by random noise. The trained algorithm can be used simultaneously to perform estimations on the physical parameters of the model, simulate the detector response with high fidelity and to denoise detector signals.
    Bilateral Deep Reinforcement Learning Approach for Better-than-human Car Following Model. (arXiv:2203.04749v1 [cs.RO])
    In the coming years and decades, autonomous vehicles (AVs) will become increasingly prevalent, offering new opportunities for safer and more convenient travel and potentially smarter traffic control methods exploiting automation and connectivity. Car following is a prime function in autonomous driving. Car following based on reinforcement learning has received attention in recent years with the goal of learning and achieving performance levels comparable to humans. However, most existing RL methods model car following as a unilateral problem, sensing only the vehicle ahead. Recent literature, however, Wang and Horn [16] has shown that bilateral car following that considers the vehicle ahead and the vehicle behind exhibits better system stability. In this paper we hypothesize that this bilateral car following can be learned using RL, while learning other goals such as efficiency maximisation, jerk minimization, and safety rewards leading to a learned model that outperforms human driving. We propose and introduce a Deep Reinforcement Learning (DRL) framework for car following control by integrating bilateral information into both state and reward function based on the bilateral control model (BCM) for car following control. Furthermore, we use a decentralized multi-agent reinforcement learning framework to generate the corresponding control action for each agent. Our simulation results demonstrate that our learned policy is better than the human driving policy in terms of (a) inter-vehicle headways, (b) average speed, (c) jerk, (d) Time to Collision (TTC) and (e) string stability.  ( 2 min )
    P2M: A Processing-in-Pixel-in-Memory Paradigm for Resource-Constrained TinyML Applications. (arXiv:2203.04737v1 [cs.LG])
    The demand to process vast amounts of data generated from state-of-the-art high resolution cameras has motivated novel energy-efficient on-device AI solutions. Visual data in such cameras are usually captured in the form of analog voltages by a sensor pixel array, and then converted to the digital domain for subsequent AI processing using analog-to-digital converters (ADC). Recent research has tried to take advantage of massively parallel low-power analog/digital computing in the form of near- and in-sensor processing, in which the AI computation is performed partly in the periphery of the pixel array and partly in a separate on-board CPU/accelerator. Unfortunately, high-resolution input images still need to be streamed between the camera and the AI processing unit, frame by frame, causing energy, bandwidth, and security bottlenecks. To mitigate this problem, we propose a novel Processing-in-Pixel-in-memory (P2M) paradigm, that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution and ReLU (Rectified Linear Units). Our solution includes a holistic algorithm-circuit co-design approach and the resulting P2M paradigm can be used as a drop-in replacement for embedding memory-intensive first few layers of convolutional neural network (CNN) models within foundry-manufacturable CMOS image sensor platforms. Our experimental results indicate that P2M reduces data transfer bandwidth from sensors and analog to digital conversions by ~21x, and the energy-delay product (EDP) incurred in processing a MobileNetV2 model on a TinyML use case for visual wake words dataset (VWW) by up to ~11x compared to standard near-processing or in-sensor implementations, without any significant drop in test accuracy.  ( 2 min )
    Anomaly Detection for Unmanned Aerial Vehicle Sensor Data Using a Stacked Recurrent Autoencoder Method with Dynamic Thresholding. (arXiv:2203.04734v1 [cs.LG])
    With substantial recent developments in aviation technologies, Unmanned Aerial Vehicles (UAVs) are becoming increasingly integrated in commercial and military operations internationally. Research into the applications of aircraft data is essential in improving safety, reducing operational costs, and developing the next frontier of aerial technology. Having an outlier detection system that can accurately identify anomalous behaviour in aircraft is crucial for these reasons. This paper proposes a system incorporating a Long Short-Term Memory (LSTM) Deep Learning Autoencoder based method with a novel dynamic thresholding algorithm and weighted loss function for anomaly detection of a UAV dataset, in order to contribute to the ongoing efforts that leverage innovations in machine learning and data analysis within the aviation industry. The dynamic thresholding and weighted loss functions showed promising improvements to the standard static thresholding method, both in accuracy-related performance metrics and in speed of true fault detection.  ( 2 min )
    New Coresets for Projective Clustering and Applications. (arXiv:2203.04370v1 [cs.LG])
    $(j,k)$-projective clustering is the natural generalization of the family of $k$-clustering and $j$-subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$. Moreover, we give the first strong coreset construction for general $M$-estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$, and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.  ( 2 min )
    Pretrained Domain-Specific Language Model for General Information Retrieval Tasks in the AEC Domain. (arXiv:2203.04729v1 [cs.CL])
    As an essential task for the architecture, engineering, and construction (AEC) industry, information retrieval (IR) from unstructured textual data based on natural language processing (NLP) is gaining increasing attention. Although various deep learning (DL) models for IR tasks have been investigated in the AEC domain, it is still unclear how domain corpora and domain-specific pretrained DL models can improve performance in various IR tasks. To this end, this work systematically explores the impacts of domain corpora and various transfer learning techniques on the performance of DL models for IR tasks and proposes a pretrained domain-specific language model for the AEC domain. First, both in-domain and close-domain corpora are developed. Then, two types of pretrained models, including traditional wording embedding models and BERT-based models, are pretrained based on various domain corpora and transfer learning strategies. Finally, several widely used DL models for IR tasks are further trained and tested based on various configurations and pretrained models. The result shows that domain corpora have opposite effects on traditional word embedding models for text classification and named entity recognition tasks but can further improve the performance of BERT-based models in all tasks. Meanwhile, BERT-based models dramatically outperform traditional methods in all IR tasks, with maximum improvements of 5.4% and 10.1% in the F1 score, respectively. This research contributes to the body of knowledge in two ways: 1) demonstrating the advantages of domain corpora and pretrained DL models and 2) opening the first domain-specific dataset and pretrained language model for the AEC domain, to the best of our knowledge. Thus, this work sheds light on the adoption and application of pretrained models in the AEC domain.  ( 2 min )
    Quantum neural networks force fields generation. (arXiv:2203.04666v1 [quant-ph])
    Accurate molecular force fields are of paramount importance for the efficient implementation of molecular dynamics techniques at large scales. In the last decade, machine learning methods have demonstrated impressive performances in predicting accurate values for energy and forces when trained on finite size ensembles generated with ab initio techniques. At the same time, quantum computers have recently started to offer new viable computational paradigms to tackle such problems. On the one hand, quantum algorithms may notably be used to extend the reach of electronic structure calculations. On the other hand, quantum machine learning is also emerging as an alternative and promising path to quantum advantage. Here we follow this second route and establish a direct connection between classical and quantum solutions for learning neural network potentials. To this end, we design a quantum neural network architecture and apply it successfully to different molecules of growing complexity. The quantum models exhibit larger effective dimension with respect to classical counterparts and can reach competitive performances, thus pointing towards potential quantum advantages in natural science applications via quantum machine learning.  ( 2 min )
    Representation, learning, and planning algorithms for geometric task and motion planning. (arXiv:2203.04605v1 [cs.RO])
    We present a framework for learning to guide geometric task and motion planning (GTAMP). GTAMP is a subclass of task and motion planning in which the goal is to move multiple objects to target regions among movable obstacles. A standard graph search algorithm is not directly applicable, because GTAMP problems involve hybrid search spaces and expensive action feasibility checks. To handle this, we introduce a novel planner that extends basic heuristic search with random sampling and a heuristic function that prioritizes feasibility checking on promising state action pairs. The main drawback of such pure planners is that they lack the ability to learn from planning experience to improve their efficiency. We propose two learning algorithms to address this. The first is an algorithm for learning a rank function that guides the discrete task level search, and the second is an algorithm for learning a sampler that guides the continuous motionlevel search. We propose design principles for designing data efficient algorithms for learning from planning experience and representations for effective generalization. We evaluate our framework in challenging GTAMP problems, and show that we can improve both planning and data efficiency  ( 2 min )
    Non-equilibrium molecular geometries in graph neural networks. (arXiv:2203.04697v1 [q-bio.BM])
    Graph neural networks have become a powerful framework for learning complex structure-property relationships and fast screening of chemical compounds. Recently proposed methods have demonstrated that using 3D geometry information of the molecule along with the bonding structure can lead to more accurate prediction on a wide range of properties. A common practice is to use 3D geometries computed through density functional theory (DFT) for both training and testing of models. However, the computational time needed for DFT calculations can be prohibitively large. Moreover, many of the properties that we aim to predict can often be obtained with little or no overhead on top of the DFT calculations used to produce the 3D geometry information, voiding the need for a predictive model. To be practically useful for high-throughput chemical screening and drug discovery, it is desirable to work with 3D geometries obtained using less-accurate but much more efficient non-DFT methods. In this work we investigate the impact of using non-DFT conformations in the training and the testing of existing models and propose a data augmentation method for improving the prediction accuracy of classical forcefield-derived geometries.  ( 2 min )
    On a linear fused Gromov-Wasserstein distance for graph structured data. (arXiv:2203.04711v1 [cs.LG])
    We present a framework for embedding graph structured data into a vector space, taking into account node features and topology of a graph into the optimal transport (OT) problem. Then we propose a novel distance between two graphs, named linearFGW, defined as the Euclidean distance between their embeddings. The advantages of the proposed distance are twofold: 1) it can take into account node feature and structure of graphs for measuring the similarity between graphs in a kernel-based framework, 2) it can be much faster for computing kernel matrix than pairwise OT-based distances, particularly fused Gromov-Wasserstein, making it possible to deal with large-scale data sets. After discussing theoretical properties of linearFGW, we demonstrate experimental results on classification and clustering tasks, showing the effectiveness of the proposed linearFGW.  ( 2 min )
    UENAS: A Unified Evolution-based NAS Framework. (arXiv:2203.04300v1 [cs.LG])
    Neural architecture search (NAS) has gained significant attention for automatic network design in recent years. Previous NAS methods suffer from limited search spaces, which may lead to sub-optimal results. In this paper, we propose UENAS, an evolution-based NAS framework with a broader search space that supports optimizing network architectures, pruning strategies, and hyperparameters simultaneously. To alleviate the huge search cost caused by the expanded search space, three strategies are adopted: First, an adaptive pruning strategy that iteratively trims the average model size in the population without compromising performance. Second, child networks share weights of overlapping layers with pre-trained parent networks to reduce the training epochs. Third, an online predictor scores the joint representations of architecture, pruning strategy, and hyperparameters to filter out inferior combos. By the proposed three strategies, the search efficiency is significantly improved and more well-performed compact networks with tailored hyper-parameters are derived. In experiments, UENAS achieves error rates of 2.81% on CIFAR-10, 20.24% on CIFAR-100, and 33% on Tiny-ImageNet, which shows the effectiveness of our method.  ( 2 min )
    Cluster Head Detection for Hierarchical UAV Swarm With Graph Self-supervised Learning. (arXiv:2203.04311v1 [cs.LG])
    In this paper, we study the cluster head detection problem of a two-level unmanned aerial vehicle (UAV) swarm network (USNET) with multiple UAV clusters, where the inherent follow strategy (IFS) of low-level follower UAVs (FUAVs) with respect to high-level cluster head UAVs (HUAVs) is unknown. We first propose a graph attention self-supervised learning algorithm (GASSL) to detect the HUAVs of a single UAV cluster, where the GASSL can fit the IFS at the same time. Then, to detect the HUAVs in the USNET with multiple UAV clusters, we develop a multi-cluster graph attention self-supervised learning algorithm (MC-GASSL) based on the GASSL. The MC-GASSL clusters the USNET with a gated recurrent unit (GRU)-based metric learning scheme and finds the HUAVs in each cluster with GASSL. Numerical results show that the GASSL can detect the HUAVs in single UAV clusters obeying various kinds of IFSs with over 98% average accuracy. The simulation results also show that the clustering purity of the USNET with MC-GASSL exceeds that with traditional clustering algorithms by at least 10% average. Furthermore, the MC-GASSL can efficiently detect all the HUAVs in USNETs with various IFSs and cluster numbers with low detection redundancies.  ( 2 min )
    Towards performant and reliable undersampled MR reconstruction via diffusion model sampling. (arXiv:2203.04292v1 [eess.IV])
    Magnetic Resonance (MR) image reconstruction from under-sampled acquisition promises faster scanning time. To this end, current State-of-The-Art (SoTA) approaches leverage deep neural networks and supervised training to learn a recovery model. While these approaches achieve impressive performances, the learned model can be fragile on unseen degradation, e.g. when given a different acceleration factor. These methods are also generally deterministic and provide a single solution to an ill-posed problem; as such, it can be difficult for practitioners to understand the reliability of the reconstruction. We introduce DiffuseRecon, a novel diffusion model-based MR reconstruction method. DiffuseRecon guides the generation process based on the observed signals and a pre-trained diffusion model, and does not require additional training on specific acceleration factors. DiffuseRecon is stochastic in nature and generates results from a distribution of fully-sampled MR images; as such, it allows us to explicitly visualize different potential reconstruction solutions. Lastly, DiffuseRecon proposes an accelerated, coarse-to-fine Monte-Carlo sampling scheme to approximate the most likely reconstruction candidate. The proposed DiffuseRecon achieves SoTA performances reconstructing from raw acquisition signals in fastMRI and SKM-TEA.  ( 2 min )
    Learning from Few Examples: A Summary of Approaches to Few-Shot Learning. (arXiv:2203.04291v1 [cs.LG])
    Few-Shot Learning refers to the problem of learning the underlying pattern in the data just from a few training samples. Requiring a large number of data samples, many deep learning solutions suffer from data hunger and extensively high computation time and resources. Furthermore, data is often not available due to not only the nature of the problem or privacy concerns but also the cost of data preparation. Data collection, preprocessing, and labeling are strenuous human tasks. Therefore, few-shot learning that could drastically reduce the turnaround time of building machine learning applications emerges as a low-cost solution. This survey paper comprises a representative list of recently proposed few-shot learning algorithms. Given the learning dynamics and characteristics, the approaches to few-shot learning problems are discussed in the perspectives of meta-learning, transfer learning, and hybrid approaches (i.e., different variations of the few-shot learning problem).  ( 2 min )
    Unsupervised Domain Adaptation across FMCW Radar Configurations Using Margin Disparity Discrepancy. (arXiv:2203.04588v1 [eess.SP])
    Commercial radar sensing is gaining relevance and machine learning algorithms constitute one of the key components that are enabling the spread of this radio technology into areas like surveillance or healthcare. However, radar datasets are still scarce and generalization cannot be yet achieved for all radar systems, environment conditions or design parameters. A certain degree of fine tuning is, therefore, usually required to deploy machine-learning-enabled radar applications. In this work, we consider the problem of unsupervised domain adaptation across radar configurations in the context of deep-learning human activity classification using frequency-modulated continuous-wave. For that, we focus on the theory-inspired technique of Margin Disparity Discrepancy, which has already been proved successful in the area of computer vision. Our experiments extend this technique to radar data, achieving a comparable accuracy to fewshot supervised approaches for the same classification problem.  ( 2 min )
    Revealing the Excitation Causality between Climate and Political Violence via a Neural Forward-Intensity Poisson Process. (arXiv:2203.04511v1 [cs.LG])
    The causal mechanism between climate and political violence is fraught with complex mechanisms. Current quantitative causal models rely on one or more assumptions: (1) the climate drivers persistently generate conflict, (2) the causal mechanisms have a linear relationship with the conflict generation parameter, and/or (3) there is sufficient data to inform the prior distribution. Yet, we know conflict drivers often excite a social transformation process which leads to violence (e.g., drought forces agricultural producers to join urban militia), but further climate effects do not necessarily contribute to further violence. Therefore, not only is this bifurcation relationship highly non-linear, there is also often a lack of data to support prior assumptions for high resolution modeling. Here, we aim to overcome the aforementioned causal modeling challenges by proposing a neural forward-intensity Poisson process (NFIPP) model. The NFIPP is designed to capture the potential non-linear causal mechanism in climate induced political violence, whilst being robust to sparse and timing-uncertain data. Our results span 20 recent years and reveal an excitation-based causal link between extreme climate events and political violence across diverse countries. Our climate-induced conflict model results are cross-validated against qualitative climate vulnerability indices. Furthermore, we label historical events that either improve or reduce our predictability gain, demonstrating the importance of domain expertise in informing interpretation.  ( 2 min )
    Multi-Agent Active Search using Detection and Location Uncertainty. (arXiv:2203.04524v1 [cs.RO])
    Active search refers to the task of autonomous robots (agents) detecting objects of interest (targets) in a search space using decision making algorithms that adapt to the history of their observations. It has important applications in search and rescue missions, wildlife patrolling and environment monitoring. Active search algorithms must contend with two types of uncertainty: detection uncertainty and location uncertainty. Prior work has typically focused on one of these while ignoring or engineering away the other. The more common approach in robotics is to focus on location uncertainty and remove detection uncertainty by thresholding the detection probability to zero or one. On the other hand, it is common in the sparse signal processing literature to assume the target location is accurate and focus on the uncertainty of its detection. In this work, we propose an inference method to jointly handle both target detection and location uncertainty. We then build a decision making algorithm on this inference method that uses Thompson sampling to enable efficient active search in both the single agent and multi-agent settings. We perform experiments in simulation over varying number of agents and targets to show that our inference and decision making algorithms outperform competing baselines that only account for either target detection or location uncertainty.  ( 2 min )
    Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability. (arXiv:2203.04536v1 [cs.LG])
    We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcomes generated according to the predictors' predictions. In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class $P$ containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of $P$ w.r.t. the dual Minkowski norm defined by $D$, and equivalently by the metric entropy of $D$ w.r.t. the dual Minkowski norm defined by $P$. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where $P$ contains all possible predictors, hence the sample complexity only depends on $D$. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of $D$. We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).  ( 2 min )
    Breast cancer detection using artificial intelligence techniques: A systematic literature review. (arXiv:2203.04308v1 [eess.IV])
    Cancer is one of the most dangerous diseases to humans, and yet no permanent cure has been developed for it. Breast cancer is one of the most common cancer types. According to the National Breast Cancer foundation, in 2020 alone, more than 276,000 new cases of invasive breast cancer and more than 48,000 non-invasive cases were diagnosed in the US. To put these figures in perspective, 64% of these cases are diagnosed early in the disease's cycle, giving patients a 99% chance of survival. Artificial intelligence and machine learning have been used effectively in detection and treatment of several dangerous diseases, helping in early diagnosis and treatment, and thus increasing the patient's chance of survival. Deep learning has been designed to analyze the most important features affecting detection and treatment of serious diseases. For example, breast cancer can be detected using genes or histopathological imaging. Analysis at the genetic level is very expensive, so histopathological imaging is the most common approach used to detect breast cancer. In this research work, we systematically reviewed previous work done on detection and treatment of breast cancer using genetic sequencing or histopathological imaging with the help of deep learning and machine learning. We also provide recommendations to researchers who will work in this field  ( 2 min )
    MLNav: Learning to Safely Navigate on Martian Terrains. (arXiv:2203.04563v1 [cs.RO])
    We present MLNav, a learning-enhanced path planning framework for safety-critical and resource-limited systems operating in complex environments, such as rovers navigating on Mars. MLNav makes judicious use of machine learning to enhance the efficiency of path planning while fully respecting safety constraints. In particular, the dominant computational cost in such safety-critical settings is running a model-based safety checker on the proposed paths. Our learned search heuristic can simultaneously predict the feasibility for all path options in a single run, and the model-based safety checker is only invoked on the top-scoring paths. We validate in high-fidelity simulations using both real Martian terrain data collected by the Perseverance rover, as well as a suite of challenging synthetic terrains. Our experiments show that: (i) compared to the baseline ENav path planner on board the Perserverance rover, MLNav can provide a significant improvement in multiple key metrics, such as a 10x reduction in collision checks when navigating real Martian terrains, despite being trained with synthetic terrains; and (ii) MLNav can successfully navigate highly challenging terrains where the baseline ENav fails to find a feasible path before timing out.  ( 2 min )
    Reinforced Meta Active Learning. (arXiv:2203.04573v1 [cs.LG])
    In stream-based active learning, the learning procedure typically has access to a stream of unlabeled data instances and must decide for each instance whether to label it and use it for training or to discard it. There are numerous active learning strategies which try to minimize the number of labeled samples required for training in this setting by identifying and retaining the most informative data samples. Most of these schemes are rule-based and rely on the notion of uncertainty, which captures how small the distance of a data sample is from the classifier's decision boundary. Recently, there have been some attempts to learn optimal selection strategies directly from the data, but many of them are still lacking generality for several reasons: 1) They focus on specific classification setups, 2) They rely on rule-based metrics, 3) They require offline pre-training of the active learner on related tasks. In this work we address the above limitations and present an online stream-based meta active learning method which learns on the fly an informativeness measure directly from the data, and is applicable to a general class of classification problems without any need for pretraining of the active learner on related tasks. The method is based on reinforcement learning and combines episodic policy search and a contextual bandits approach which are used to train the active learner in conjunction with training of the model. We demonstrate on several real datasets that this method learns to select training samples more efficiently than existing state-of-the-art methods.  ( 2 min )
    The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks. (arXiv:2203.04466v1 [cs.LG])
    Neural networks tend to achieve better accuracy with training if they are larger -- even if the resulting models are overparameterized. Nevertheless, carefully removing such excess parameters before, during, or after training may also produce models with similar or even improved accuracy. In many cases, that can be curiously achieved by heuristics as simple as removing a percentage of the weights with the smallest absolute value -- even though magnitude is not a perfect proxy for weight relevance. With the premise that obtaining significantly better performance from pruning depends on accounting for the combined effect of removing multiple weights, we revisit one of the classic approaches for impact-based pruning: the Optimal Brain Surgeon~(OBS). We propose a tractable heuristic for solving the combinatorial extension of OBS, in which we select weights for simultaneous removal, as well as a systematic update of the remaining weights. Our selection method outperforms other methods under high sparsity, and the weight update is advantageous even when combined with the other methods.  ( 2 min )
    MetaCon: Unified Predictive Segments System with Trillion Concept Meta-Learning. (arXiv:2203.04540v1 [cs.AI])
    Accurate understanding of users in terms of predicative segments play an essential role in the day to day operation of modern internet enterprises. Nevertheless, there are significant challenges that limit the quality of data, especially on long tail predictive tasks. In this work, we present MetaCon, our unified predicative segments system with scalable, trillion concepts meta learning that addresses these challenges. It builds on top of a flat concept representation that summarizes entities' heterogeneous digital footprint, jointly considers the entire spectrum of predicative tasks as a single learning task, and leverages principled meta learning approach with efficient first order meta-optimization procedure under a provable performance guarantee in order to solve the learning task. Experiments on both proprietary production datasets and public structured learning tasks demonstrate that MetaCon can lead to substantial improvements over state of the art recommendation and ranking approaches.  ( 2 min )
    Leveraging Smooth Attention Prior for Multi-Agent Trajectory Prediction. (arXiv:2203.04421v1 [cs.LG])
    Multi-agent interactions are important to model for forecasting other agents' behaviors and trajectories. At a certain time, to forecast a reasonable future trajectory, each agent needs to pay attention to the interactions with only a small group of most relevant agents instead of unnecessarily paying attention to all the other agents. However, existing attention modeling works ignore that human attention in driving does not change rapidly, and may introduce fluctuating attention across time steps. In this paper, we formulate an attention model for multi-agent interactions based on a total variation temporal smoothness prior and propose a trajectory prediction architecture that leverages the knowledge of these attended interactions. We demonstrate how the total variation attention prior along with the new sequence prediction loss terms leads to smoother attention and more sample-efficient learning of multi-agent trajectory prediction, and show its advantages in terms of prediction accuracy by comparing it with the state-of-the-art approaches on both synthetic and naturalistic driving data. We demonstrate the performance of our algorithm for trajectory prediction on the INTERACTION dataset on our website.  ( 2 min )
    Score matching enables causal discovery of nonlinear additive noise models. (arXiv:2203.04413v1 [cs.LG])
    This paper demonstrates how to recover causal graphs from the score of the data distribution in non-linear additive (Gaussian) noise models. Using score matching algorithms as a building block, we show how to design a new generation of scalable causal discovery methods. To showcase our approach, we also propose a new efficient method for approximating the score's Jacobian, enabling to recover the causal graph. Empirically, we find that the new algorithm, called SCORE, is competitive with state-of-the-art causal discovery methods while being significantly faster.  ( 2 min )
    Art-Attack: Black-Box Adversarial Attack via Evolutionary Art. (arXiv:2203.04405v1 [cs.CR])
    Deep neural networks (DNNs) have achieved state-of-the-art performance in many tasks but have shown extreme vulnerabilities to attacks generated by adversarial examples. Many works go with a white-box attack that assumes total access to the targeted model including its architecture and gradients. A more realistic assumption is the black-box scenario where an attacker only has access to the targeted model by querying some input and observing its predicted class probabilities. Different from most prevalent black-box attacks that make use of substitute models or gradient estimation, this paper proposes a gradient-free attack by using a concept of evolutionary art to generate adversarial examples that iteratively evolves a set of overlapping transparent shapes. To evaluate the effectiveness of our proposed method, we attack three state-of-the-art image classification models trained on the CIFAR-10 dataset in a targeted manner. We conduct a parameter study outlining the impact the number and type of shapes have on the proposed attack's performance. In comparison to state-of-the-art black-box attacks, our attack is more effective at generating adversarial examples and achieves a higher attack success rate on all three baseline models.  ( 2 min )
    ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches. (arXiv:2203.04412v1 [cs.CR])
    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding, and requires careful hyperparameter tuning, potentially leading to suboptimal robustness evaluations. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches, optimized to generalize across different models, and readily applicable to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations. We showcase the usefulness of this dataset by testing the effectiveness of the computed patches against 127 models. We conclude by discussing how our dataset could be used as a benchmark for robustness, and how our methodology can be generalized to other domains. We open source our dataset and evaluation code at https://github.com/pralab/ImageNet-Patch.  ( 2 min )
    CIDER: Exploiting Hyperspherical Embeddings for Out-of-Distribution Detection. (arXiv:2203.04450v1 [cs.CV])
    Out-of-distribution (OOD) detection is a critical task for reliable machine learning. Recent advances in representation learning give rise to developments in distance-based OOD detection, where testing samples are detected as OOD if they are relatively far away from the centroids or prototypes of in-distribution (ID) classes. However, prior methods directly take off-the-shelf loss functions that suffice for classifying ID samples, but are not optimally designed for OOD detection. In this paper, we propose CIDER, a simple and effective representation learning framework by exploiting hyperspherical embeddings for OOD detection. CIDER jointly optimizes two losses to promote strong ID-OOD separability: (1) a dispersion loss that promotes large angular distances among different class prototypes, and (2) a compactness loss that encourages samples to be close to their class prototypes. We show that CIDER is effective under various settings and establishes state-of-the-art performance. On a hard OOD detection task CIFAR-100 vs. CIFAR-10, our method substantially improves the AUROC by 14.20% compared to the embeddings learned by the cross-entropy loss.  ( 2 min )
    Pruning Graph Convolutional Networks to select meaningful graph frequencies for fMRI decoding. (arXiv:2203.04455v1 [cs.LG])
    Graph Signal Processing is a promising framework to manipulate brain signals as it allows to encompass the spatial dependencies between the activity in regions of interest in the brain. In this work, we are interested in better understanding what are the graph frequencies that are the most useful to decode fMRI signals. To this end, we introduce a deep learning architecture and adapt a pruning methodology to automatically identify such frequencies. We experiment with various datasets, architectures and graphs, and show that low graph frequencies are consistently identified as the most important for fMRI decoding, with a stronger contribution for the functional graph over the structural one. We believe that this work provides novel insights on how graph-based methods can be deployed to increase fMRI decoding accuracy and interpretability.  ( 2 min )
    The Flag Median and FlagIRLS. (arXiv:2203.04437v1 [stat.ML])
    Finding prototypes (e.g., mean and median) for a dataset is central to a number of common machine learning algorithms. Subspaces have been shown to provide useful, robust representations for datasets of images, videos and more. Since subspaces correspond to points on a Grassmann manifold, one is led to consider the idea of a subspace prototype for a Grassmann-valued dataset. While a number of different subspace prototypes have been described, the calculation of some of these prototypes has proven to be computationally expensive while other prototypes are affected by outliers and produce highly imperfect clustering on noisy data. This work proposes a new subspace prototype, the flag median, and introduces the FlagIRLS algorithm for its calculation. We provide evidence that the flag median is robust to outliers and can be used effectively in algorithms like Linde-Buzo-Grey (LBG) to produce improved clusterings on Grassmannians. Numerical experiments include a synthetic dataset, the MNIST handwritten digits dataset, the Mind's Eye video dataset and the UCF YouTube action dataset. The flag median is compared the other leading algorithms for computing prototypes on the Grassmannian, namely, the $\ell_2$-median and to the flag mean. We find that using FlagIRLS to compute the flag median converges in $4$ iterations on a synthetic dataset. We also see that Grassmannian LBG with a codebook size of $20$ and using the flag median produces at least a $10\%$ improvement in cluster purity over Grassmannian LBG using the flag mean or $\ell_2$-median on the Mind's Eye dataset.  ( 2 min )
    KPF-AE-LSTM: A Deep Probabilistic Model for Net-Load Forecasting in High Solar Scenarios. (arXiv:2203.04401v1 [eess.SP])
    With the expected rise in behind-the-meter solar penetration within the distribution networks, there is a need to develop time-series forecasting methods that can reliably predict the net-load, accurately quantifying its uncertainty and variability. This paper presents a deep learning method to generate probabilistic forecasts of day-ahead net-load at 15-min resolution, at various solar penetration levels. Our proposed deep-learning based architecture utilizes the dimensional reduction, from a higher-dimensional input to a lower-dimensional latent space, via a convolutional Autoencoder (AE). The extracted features from AE are then utilized to generate probability distributions across the latent space, by passing the features through a kernel-embedded Perron-Frobenius (kPF) operator. Finally, long short-term memory (LSTM) layers are used to synthesize time-series probability distributions of the forecasted net-load, from the latent space distributions. The models are shown to deliver superior forecast performance (as per several metrics), as well as maintain superior training efficiency, in comparison to existing benchmark models. Detailed analysis is carried out to evaluate the model performance across various solar penetration levels (up to 50\%), prediction horizons (e.g., 15\,min and 24\,hr ahead), and aggregation level of houses, as well as its robustness against missing measurements.  ( 2 min )
    Structural Learning of Simple Staged Trees. (arXiv:2203.04390v1 [stat.ML])
    Bayesian networks faithfully represent the symmetric conditional independences existing between the components of a random vector. Staged trees are an extension of Bayesian networks for categorical random vectors whose graph represents non-symmetric conditional independences via vertex coloring. However, since they are based on a tree representation of the sample space, the underlying graph becomes cluttered and difficult to visualize as the number of variables increases. Here we introduce the first structural learning algorithms for the class of simple staged trees, entertaining a compact coalescence of the underlying tree from which non-symmetric independences can be easily read. We show that data-learned simple staged trees often outperform Bayesian networks in model fit and illustrate how the coalesced graph is used to identify non-symmetric conditional independences.  ( 2 min )
    On generative models as the basis for digital twins. (arXiv:2203.04384v1 [cs.LG])
    A framework is proposed for generative models as a basis for digital twins or mirrors of structures. The proposal is based on the premise that deterministic models cannot account for the uncertainty present in most structural modelling applications. Two different types of generative models are considered here. The first is a physics-based model based on the stochastic finite element (SFE) method, which is widely used when modelling structures that have material and loading uncertainties imposed. Such models can be calibrated according to data from the structure and would be expected to outperform any other model if the modelling accurately captures the true underlying physics of the structure. The potential use of SFE models as digital mirrors is illustrated via application to a linear structure with stochastic material properties. For situations where the physical formulation of such models does not suffice, a data-driven framework is proposed, using machine learning and conditional generative adversarial networks (cGANs). The latter algorithm is used to learn the distribution of the quantity of interest in a structure with material nonlinearities and uncertainties. For the examples considered in this work, the data-driven cGANs model outperform the physics-based approach. Finally, an example is shown where the two methods are coupled such that a hybrid model approach is demonstrated.  ( 2 min )
    A Novel Deep Learning Model for Hotel Demand and Revenue Prediction amid COVID-19. (arXiv:2203.04383v1 [cs.LG])
    The COVID-19 pandemic has significantly impacted the tourism and hospitality sector. Public policies such as travel restrictions and stay-at-home orders had significantly affected tourist activities and service businesses' operations and profitability. To this end, it is essential to develop an interpretable forecast model that supports managerial and organizational decision-making. We developed DemandNet, a novel deep learning framework for predicting time series data under the influence of the COVID-19 pandemic. The framework starts by selecting the top static and dynamic features embedded in the time series data. Then, it includes a nonlinear model which can provide interpretable insight into the previously seen data. Lastly, a prediction model is developed to leverage the above characteristics to make robust long-term forecasts. We evaluated the framework using daily hotel demand and revenue data from eight cities in the US. Our findings reveal that DemandNet outperforms the state-of-art models and can accurately predict the impact of the COVID-19 pandemic on hotel demand and revenues.  ( 2 min )
    PyNET-QxQ: A Distilled PyNET for QxQ Bayer Pattern Demosaicing in CMOS Image Sensor. (arXiv:2203.04314v1 [eess.IV])
    The deep learning-based ISP models for mobile cameras produce high-quality images comparable to the professional DSLR camera. However, many of them are computationally expensive, which may not be appropriate for mobile environments. Also, the recent mobile cameras adopt non-Bayer CFAs (e.g., Quad Bayer, Nona Bayer, and QxQ Bayer) to improve image quality; however, most deep learning-based ISP models mainly focus on standard Bayer CFA. In this work, we propose PyNET-QxQ based on PyNET, a light-weighted ISP explicitly designed for the QxQ CFA pattern. The number of parameters of PyNET-QxQ is less than 2.5% of PyNET. We also introduce a novel knowledge distillation technique, progressive distillation, to train the compressed network effectively. Finally, experiments with QxQ images (obtained by an actual QxQ camera sensor, under development) demonstrate the outstanding performance of PyNET-QxQ despite significant parameter reductions.  ( 2 min )
    TTML: tensor trains for general supervised machine learning. (arXiv:2203.04352v1 [cs.LG])
    This work proposes a novel general-purpose estimator for supervised machine learning (ML) based on tensor trains (TT). The estimator uses TTs to parametrize discretized functions, which are then optimized using Riemannian gradient descent under the form of a tensor completion problem. Since this optimization is sensitive to initialization, it turns out that the use of other ML estimators for initialization is crucial. This results in a competitive, fast ML estimator with lower memory usage than many other ML estimators, like the ones used for the initialization.  ( 2 min )
    MICDIR: Multi-scale Inverse-consistent Deformable Image Registration using UNetMSS with Self-Constructing Graph Latent. (arXiv:2203.04317v1 [eess.IV])
    Image registration is the process of bringing different images into a common coordinate system - a technique widely used in various applications of computer vision, such as remote sensing, image retrieval, and most commonly in medical imaging. Deep Learning based techniques have been applied successfully to tackle various complex medical image processing problems, including medical image registration. Over the years, several image registration techniques have been proposed using deep learning. Deformable image registration techniques such as Voxelmorph have been successful in capturing finer changes and providing smoother deformations. However, Voxelmorph, as well as ICNet and FIRE, do not explicitly encode global dependencies (i.e. the overall anatomical view of the supplied image) and therefore can not track large deformations. In order to tackle the aforementioned problems, this paper extends the Voxelmorph approach in three different ways. To improve the performance in case of small as well as large deformations, supervision of the model at different resolutions have been integrated using a multi-scale UNet. To support the network to learn and encode the minute structural co-relations of the given image-pairs, a self-constructing graph network (SCGNet) has been used as the latent of the multi-scale UNet - which can improve the learning process of the model and help the model to generalise better. And finally, to make the deformations inverse-consistent, cycle consistency loss has been employed. On the task of registration of brain MRIs, the proposed method achieved significant improvements over ANTs and VoxelMorph, obtaining a Dice score of 0.8013$\pm$0.0243 for intramodal and 0.6211$\pm$0.0309 for intermodal, while VoxelMorph achieved 0.7747$\pm$0.0260 and 0.6071$\pm$0.0510, respectively.  ( 2 min )
    Multi-Agent Broad Reinforcement Learning for Intelligent Traffic Light Control. (arXiv:2203.04310v1 [cs.LG])
    Intelligent Traffic Light Control System (ITLCS) is a typical Multi-Agent System (MAS), which comprises multiple roads and traffic lights.Constructing a model of MAS for ITLCS is the basis to alleviate traffic congestion. Existing approaches of MAS are largely based on Multi-Agent Deep Reinforcement Learning (MADRL). Although the Deep Neural Network (DNN) of MABRL is effective, the training time is long, and the parameters are difficult to trace. Recently, Broad Learning Systems (BLS) provided a selective way for learning in the deep neural networks by a flat network. Moreover, Broad Reinforcement Learning (BRL) extends BLS in Single Agent Deep Reinforcement Learning (SADRL) problem with promising results. However, BRL does not focus on the intricate structures and interaction of agents. Motivated by the feature of MADRL and the issue of BRL, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework to explore the function of BLS in MAS. Firstly, unlike most existing MADRL approaches, which use a series of deep neural networks structures, we model each agent with broad networks. Then, we introduce a dynamic self-cycling interaction mechanism to confirm the "3W" information: When to interact, Which agents need to consider, What information to transmit. Finally, we do the experiments based on the intelligent traffic light control scenario. We compare the MABRL approach with six different approaches, and experimental results on three datasets verify the effectiveness of MABRL.  ( 2 min )
    A Machine Learning Approach to Digital Contact Tracing: TC4TL Challenge. (arXiv:2203.04307v1 [cs.LG])
    Contact tracing is a method used by public health organisations to try prevent the spread of infectious diseases in the community. Traditionally performed by manual contact tracers, more recently the use of apps have been considered utilising phone sensor data to determine the distance between two phones. In this paper, we investigate the development of machine learning approaches to determine the distance between two mobile phone devices using Bluetooth Low Energy, sensory data and meta data. We use TableNet architecture and feature engineering to improve on the existing state of the art (total nDCF 0.21 vs 2.08), significantly outperforming existing models.  ( 2 min )
    CaSS: A Channel-aware Self-supervised Representation Learning Framework for Multivariate Time Series Classification. (arXiv:2203.04298v1 [cs.LG])
    Self-supervised representation learning of Multivariate Time Series (MTS) is a challenging task and attracts increasing research interests in recent years. Many previous works focus on the pretext task of self-supervised learning and usually neglect the complex problem of MTS encoding, leading to unpromising results. In this paper, we tackle this challenge from two aspects: encoder and pretext task, and propose a unified channel-aware self-supervised learning framework CaSS. Specifically, we first design a new Transformer-based encoder Channel-aware Transformer (CaT) to capture the complex relationships between different time channels of MTS. Second, we combine two novel pretext tasks Next Trend Prediction (NTP) and Contextual Similarity (CS) for the self-supervised representation learning with our proposed encoder. Extensive experiments are conducted on several commonly used benchmark datasets. The experimental results show that our framework achieves new state-of-the-art comparing with previous self-supervised MTS representation learning methods (up to +7.70\% improvement on LSST dataset) and can be well applied to the downstream MTS classification.  ( 2 min )
    R\'enyi State Entropy for Exploration Acceleration in Reinforcement Learning. (arXiv:2203.04297v1 [cs.LG])
    One of the most critical challenges in deep reinforcement learning is to maintain the long-term exploration capability of the agent. To tackle this problem, it has been recently proposed to provide intrinsic rewards for the agent to encourage exploration. However, most existing intrinsic reward-based methods proposed in the literature fail to provide sustainable exploration incentives, a problem known as vanishing rewards. In addition, these conventional methods incur complex models and additional memory in their learning procedures, resulting in high computational complexity and low robustness. In this work, a novel intrinsic reward module based on the R\'enyi entropy is proposed to provide high-quality intrinsic rewards. It is shown that the proposed method actually generalizes the existing state entropy maximization methods. In particular, a $k$-nearest neighbor estimator is introduced for entropy estimation while a $k$-value search method is designed to guarantee the estimation accuracy. Extensive simulation results demonstrate that the proposed R\'enyi entropy-based method can achieve higher performance as compared to existing schemes.  ( 2 min )
    Self-supervised learning for analysis of temporal and morphological drug effects in cancer cell imaging data. (arXiv:2203.04289v1 [q-bio.QM])
    In this work, we propose two novel methodologies to study temporal and morphological phenotypic effects caused by different experimental conditions using imaging data. As a proof of concept, we apply them to analyze drug effects in 2D cancer cell cultures. We train a convolutional autoencoder on 1M images dataset with random augmentations and multi-crops to use as feature extractor. We systematically compare it to the pretrained state-of-the-art models. We further use the feature extractor in two ways. First, we apply distance-based analysis and dynamic time warping to cluster temporal patterns of 31 drugs. We identify clusters allowing annotation of drugs as having cytotoxic, cytostatic, mixed or no effect. Second, we implement an adversarial/regularized learning setup to improve classification of 31 drugs and visualize image regions that contribute to the improvement. We increase top-3 classification accuracy by 8% on average and mine examples of morphological feature importance maps. We provide the feature extractor and the weights to foster transfer learning applications in biology. We also discuss utility of other pretrained models and applicability of our methods to other types of biomedical data.  ( 2 min )
  • Open

    Are All Layers Created Equal?. (arXiv:1902.01996v4 [stat.ML] UPDATED)
    Understanding deep neural networks is a major research objective with notable experimental and theoretical attention in recent years. The practical success of excessively large networks underscores the need for better theoretical analyses and justifications. In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers' robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either "robust" or "critical". Resetting the robust layers to their initial values does not result in adverse decline in performance. In many cases, robust layers hardly change throughout training. In contrast, re-initializing critical layers vastly degrades the performance of the network with test error essentially dropping to random guesses. Our study provides further evidence that mere parameter counting or norm calculations are too coarse in studying generalization of deep models, and "flatness" and robustness analysis of trained models need to be examined while taking into account the respective network architectures.
    Persistent Homology as Stopping-Criterion for Voronoi Interpolation. (arXiv:1911.02922v15 [cs.CG] UPDATED)
    In this study the Voronoi interpolation is used to interpolate a set of points drawn from a topological space with higher homology groups on its filtration. The technique is based on Voronoi tessellation, which induces a natural dual map to the Delaunay triangulation. Advantage is taken from this fact calculating the persistent homology on it after each iteration to capture the changing topology of the data. The boundary points are identified as critical. The Bottleneck and Wasserstein distance serve as a measure of quality between the original point set and the interpolation. If the norm of two distances exceeds a heuristically determined threshold, the algorithm terminates. We give the theoretical basis for this approach and justify its validity with numerical experiments.
    Variational Inference with Locally Enhanced Bounds for Hierarchical Models. (arXiv:2203.04432v1 [cs.LG])
    Hierarchical models represent a challenging setting for inference algorithms. MCMC methods struggle to scale to large models with many local variables and observations, and variational inference (VI) may fail to provide accurate approximations due to the use of simple variational families. Some variational methods (e.g. importance weighted VI) integrate Monte Carlo methods to give better accuracy, but these tend to be unsuitable for hierarchical models, as they do not allow for subsampling and their performance tends to degrade for high dimensional models. We propose a new family of variational bounds for hierarchical models, based on the application of tightening methods (e.g. importance weighting) separately for each group of local random variables. We show that our approach naturally allows the use of subsampling to get unbiased gradients, and that it fully leverages the power of methods that build tighter lower bounds by applying them independently in lower dimensional spaces, leading to better results and more accurate posterior approximations than relevant baselines.  ( 2 min )
    On generative models as the basis for digital twins. (arXiv:2203.04384v1 [cs.LG])
    A framework is proposed for generative models as a basis for digital twins or mirrors of structures. The proposal is based on the premise that deterministic models cannot account for the uncertainty present in most structural modelling applications. Two different types of generative models are considered here. The first is a physics-based model based on the stochastic finite element (SFE) method, which is widely used when modelling structures that have material and loading uncertainties imposed. Such models can be calibrated according to data from the structure and would be expected to outperform any other model if the modelling accurately captures the true underlying physics of the structure. The potential use of SFE models as digital mirrors is illustrated via application to a linear structure with stochastic material properties. For situations where the physical formulation of such models does not suffice, a data-driven framework is proposed, using machine learning and conditional generative adversarial networks (cGANs). The latter algorithm is used to learn the distribution of the quantity of interest in a structure with material nonlinearities and uncertainties. For the examples considered in this work, the data-driven cGANs model outperform the physics-based approach. Finally, an example is shown where the two methods are coupled such that a hybrid model approach is demonstrated.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v2 [cs.LG] UPDATED)
    We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability. (arXiv:2203.04536v1 [cs.LG])
    We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcomes generated according to the predictors' predictions. In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class $P$ containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of $P$ w.r.t. the dual Minkowski norm defined by $D$, and equivalently by the metric entropy of $D$ w.r.t. the dual Minkowski norm defined by $P$. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where $P$ contains all possible predictors, hence the sample complexity only depends on $D$. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of $D$. We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).
    Markov Random Geometric Graph (MRGG): A Growth Model for Temporal Dynamic Networks. (arXiv:2006.07001v3 [cs.LG] UPDATED)
    We introduce Markov Random Geometric Graphs (MRGGs), a growth model for temporal dynamic networks. It is based on a Markovian latent space dynamic: consecutive latent points are sampled on the Euclidean Sphere using an unknown Markov kernel; and two nodes are connected with a probability depending on a unknown function of their latent geodesic distance. More precisely, at each stamp-time $k$ we add a latent point $X_k$ sampled by jumping from the previous one $X_{k-1}$ in a direction chosen uniformly $Y_k$ and with a length $r_k$ drawn from an unknown distribution called the latitude function. The connection probabilities between each pair of nodes are equal to the envelope function of the distance between these two latent points. We provide theoretical guarantees for the non-parametric estimation of the latitude and the envelope functions.We propose an efficient algorithm that achieves those non-parametric estimation tasks based on an ad-hoc Hierarchical Agglomerative Clustering approach. As a by product, we show how MRGGs can be used to detect dependence structure in growing graphs and to solve link prediction problems.
    Align-Deform-Subtract: An Interventional Framework for Explaining Object Differences. (arXiv:2203.04694v1 [cs.CV])
    Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes differences in object properties. The result is a set of "disentangled" error measures which explain object differences in terms of their underlying properties. Experiments on real and synthetic data illustrate the efficacy of the framework.  ( 2 min )
    Structured Multi-task Learning for Molecular Property Prediction. (arXiv:2203.04695v1 [q-bio.BM])
    Multi-task learning for molecular property prediction is becoming increasingly important in drug discovery. However, in contrast to other domains, the performance of multi-task learning in drug discovery is still not satisfying as the number of labeled data for each task is too limited, which calls for additional data to complement the data scarcity. In this paper, we study multi-task learning for molecular property prediction in a novel setting, where a relation graph between tasks is available. We first construct a dataset including around 400 tasks as well as a task relation graph. Then to better utilize such relation graph, we propose a method called SGNN-EBM to systematically investigate the structured task modeling from two perspectives. (1) In the \emph{latent} space, we model the task representations by applying a state graph neural network (SGNN) on the relation graph. (2) In the \emph{output} space, we employ structured prediction with the energy-based model (EBM), which can be efficiently trained through noise-contrastive estimation (NCE) approach. Empirical results justify the effectiveness of SGNN-EBM. Code is available on https://github.com/chao1224/SGNN-EBM.
    IH-GAN: A Conditional Generative Model for Implicit Surface-Based Inverse Design of Cellular Structures. (arXiv:2103.02588v3 [cs.CE] UPDATED)
    Variable-density cellular structures can overcome connectivity and manufacturability issues of topologically optimized structures, particularly those represented as discrete density maps. However, the optimization of such cellular structures is challenging due to the multiscale design problem. Past work addressing this problem generally either only optimizes the volume fraction of single-type unit cells but ignoring the effects of unit cell geometry on properties, or considers the geometry-property relation but builds this relation via heuristics. In contrast, we propose a simple yet more principled way to accurately model the property to geometry mapping using a conditional deep generative model, named Inverse Homogenization Generative Adversarial Network (IH-GAN). It learns the conditional distribution of unit cell geometries given properties and can realize the one-to-many mapping from geometry to properties. We further reduce the complexity of IH-GAN by using the implicit function parameterization to represent unit cell geometries. Results show that our method can 1) generate various unit cells that satisfy given material properties with high accuracy (relative error <5%) and 2) improve the optimized structural performance over the conventional topology-optimized variable-density structure. Specifically, in the minimum compliance example, our IH-GAN generated structure achieves an 84.4% reduction in concentrated stress and an extra 7% reduction in displacement. In the target deformation examples, our IH-GAN generated structure reduces the target matching error by 24.2% and 44.4% for two test cases, respectively. We also demonstrated that the connectivity issue for multi-type unit cells can be solved by transition layer blending.  ( 2 min )
    Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures. (arXiv:2112.04828v2 [stat.ML] UPDATED)
    In this paper we consider how to evaluate survival distribution predictions with measures of discrimination. This is a non-trivial problem as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages. Whilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons. We find that the most robust method of reducing a distribution to a risk is to sum over the predicted cumulative hazard. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation. The code used in the final experiment is available at https://github.com/RaphaelS1/distribution_discrimination.
    Bayesian Calibration for Activity Based Models. (arXiv:2203.04414v1 [stat.AP])
    We consider the problem of calibration and uncertainty analysis for activity-based transportation simulators. ABMs rely on statistical models of traveler's behavior to predict travel patterns in a metropolitan area. Input parameters are typically estimated from traveler's surveys using maximum likelihood. We develop an approach that uses Gaussian process emulator to calibrate an activity-based model of a metropolitan transplantation system. Our approach extends traditional emulators to handle high-dimensional and non-stationary nature of the transportation simulator. Our methodology is applied to transportation simulator of Bloomington, Illinois. We calibrate key parameters of the model and compare to the ad-hoc calibration process.  ( 2 min )
    CatGCN: Graph Convolutional Networks with Categorical Node Features. (arXiv:2009.05303v4 [cs.LG] UPDATED)
    Recent studies on Graph Convolutional Networks (GCNs) reveal that the initial node representations (i.e., the node representations before the first-time graph convolution) largely affect the final model performance. However, when learning the initial representation for a node, most existing work linearly combines the embeddings of node features, without considering the interactions among the features (or feature embeddings). We argue that when the node features are categorical, e.g., in many real-world applications like user profiling and recommender system, feature interactions usually carry important signals for predictive analytics. Ignoring them will result in suboptimal initial node representation and thus weaken the effectiveness of the follow-up graph convolution. In this paper, we propose a new GCN model named CatGCN, which is tailored for graph learning when the node features are categorical. Specifically, we integrate two ways of explicit interaction modeling into the learning of initial node representation, i.e., local interaction modeling on each pair of node features and global interaction modeling on an artificial feature graph. We then refine the enhanced initial node representations with the neighborhood aggregation-based graph convolution. We train CatGCN in an end-to-end fashion and demonstrate it on semi-supervised node classification. Extensive experiments on three tasks of user profiling (the prediction of user age, city, and purchase level) from Tencent and Alibaba datasets validate the effectiveness of CatGCN, especially the positive effect of performing feature interaction modeling before graph convolution.
    Optimal approximation for unconstrained non-submodular minimization. (arXiv:1905.12145v4 [cs.LG] UPDATED)
    Submodular function minimization is well studied, and existing algorithms solve it exactly or up to arbitrary accuracy. However, in many applications, such as structured sparse learning or batch Bayesian optimization, the objective function is not exactly submodular, but close. In this case, no theoretical guarantees exist. Indeed, submodular minimization algorithms rely on intricate connections between submodularity and convexity. We show how these relations can be extended to obtain approximation guarantees for minimizing non-submodular functions, characterized by how close the function is to submodular. We also extend this result to noisy function evaluations. Our approximation results are the first for minimizing non-submodular functions, and are optimal, as established by our matching lower bound.
    Autoregressive based Drift Detection Method. (arXiv:2203.04769v1 [stat.ML])
    In the classic machine learning framework, models are trained on historical data and used to predict future values. It is assumed that the data distribution does not change over time (stationarity). However, in real-world scenarios, the data generation process changes over time and the model has to adapt to the new incoming data. This phenomenon is known as concept drift and leads to a decrease in the predictive model's performance. In this study, we propose a new concept drift detection method based on autoregressive models called ADDM. This method can be integrated into any machine learning algorithm from deep neural networks to simple linear regression model. Our results show that this new concept drift detection method outperforms the state-of-the-art drift detection methods, both on synthetic data sets and real-world data sets. Our approach is theoretically guaranteed as well as empirical and effective for the detection of various concept drifts. In addition to the drift detector, we proposed a new method of concept drift adaptation based on the severity of the drift.  ( 2 min )
    Monitoring Time Series With Missing Values: a Deep Probabilistic Approach. (arXiv:2203.04916v1 [stat.ML])
    Systems are commonly monitored for health and security through collection and streaming of multivariate time series. Advances in time series forecasting due to adoption of multilayer recurrent neural network architectures make it possible to forecast in high-dimensional time series, and identify and classify novelties early, based on subtle changes in the trends. However, mainstream approaches to multi-variate time series predictions do not handle well cases when the ongoing forecast must include uncertainty, nor they are robust to missing data. We introduce a new architecture for time series monitoring based on combination of state-of-the-art methods of forecasting in high-dimensional time series with full probabilistic handling of uncertainty. We demonstrate advantage of the architecture for time series forecasting and novelty detection, in particular with partially missing data, and empirically evaluate and compare the architecture to state-of-the-art approaches on a real-world data set.
    Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees. (arXiv:2203.04341v1 [stat.ME])
    A new Bayesian modelling framework is introduced for piece-wise homogeneous variable-memory Markov chains, along with a collection of effective algorithmic tools for change-point detection and segmentation of discrete time series. Building on the recently introduced Bayesian Context Trees (BCT) framework, the distributions of different segments in a discrete time series are described as variable-memory Markov chains. Inference for the presence and location of change-points is then performed via Markov chain Monte Carlo sampling. The key observation that facilitates effective sampling is that, using one of the BCT algorithms, the prior predictive likelihood of the data can be computed exactly, integrating out all the models and parameters in each segment. This makes it possible to sample directly from the posterior distribution of the number and location of the change-points, leading to accurate estimates and providing a natural quantitative measure of uncertainty in the results. Estimates of the actual model in each segment can also be obtained, at essentially no additional computational cost. Results on both simulated and real-world data indicate that the proposed methodology performs better than or as well as state-of-the-art techniques.  ( 2 min )
    Memory Efficient Continual Learning for Neural Text Classification. (arXiv:2203.04640v1 [cs.CL])
    Learning text classifiers based on pre-trained language models has become the standard practice in natural language processing applications. Unfortunately, training large neural language models, such as transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. Moreover, in many real-world scenarios, classes are uncovered as more data is seen, calling for class-incremental modelling approaches. In this work we devise a method to perform text classification using pre-trained models on a sequence of classification tasks provided in sequence. We formalize the problem as a continual learning problem where the algorithm learns new tasks without performance degradation on the previous ones and without re-training the model from scratch. We empirically demonstrate that our method requires significantly less model parameters compared to other state of the art methods and that it is significantly faster at inference time. The tight control on the number of model parameters, and so the memory, is not only improving efficiency. It is making possible the usage of the algorithm in real-world applications where deploying a solution with a constantly increasing memory consumption is just unrealistic. While our method suffers little forgetting, it retains a predictive performance on-par with state of the art but less memory efficient methods.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v1 [stat.ML])
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in the models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper analyzes data representativity in scientific literature related to AI and sampling, and gives a brief overview of statistical sampling methodology from disciplines like sampling of physical materials, experimental design, survey analysis, and observational studies. Different notions of a 'representative sample' exist in past and present literature. In particular, the contrast between the notion of a representative sample in the sense of coverage of the input space, versus a representative sample as a miniature of the target population is of relevance when building AI systems. Using empirical demonstrations on US Census data, we demonstrate that the first is useful for providing equality and demographic parity, and is more robust to distribution shifts, whereas the latter notion is useful in situations where the purpose is to make historical inference or draw inference about the underlying population in general, or make better predictions for the majority in the underlying population. We propose a framework of questions for creating and documenting data, with data representativity in mind, as an addition to existing datasheets for datasets. Finally, we will also like to call for caution of implicit, in addition to explicit, use of a notion of data representativeness without specific clarification.
    Explainable Machine Learning for Predicting Homicide Clearance in the United States. (arXiv:2203.04768v1 [cs.LG])
    Purpose: To explore the potential of Explainable Machine Learning in the prediction and detection of drivers of cleared homicides at the national- and state-levels in the United States. Methods: First, nine algorithmic approaches are compared to assess the best performance in predicting cleared homicides country-wise, using data from the Murder Accountability Project. The most accurate algorithm among all (XGBoost) is then used for predicting clearance outcomes state-wise. Second, SHAP, a framework for Explainable Artificial Intelligence, is employed to capture the most important features in explaining clearance patterns both at the national and state levels. Results: At the national level, XGBoost demonstrates to achieve the best performance overall. Substantial predictive variability is detected state-wise. In terms of explainability, SHAP highlights the relevance of several features in consistently predicting investigation outcomes. These include homicide circumstances, weapons, victims' sex and race, as well as number of involved offenders and victims. Conclusions: Explainable Machine Learning demonstrates to be a helpful framework for predicting homicide clearance. SHAP outcomes suggest a more organic integration of the two theoretical perspectives emerged in the literature. Furthermore, jurisdictional heterogeneity highlights the importance of developing ad hoc state-level strategies to improve police performance in clearing homicides.
    Efficient Kirszbraun Extension with Applications to Regression. (arXiv:1905.11930v3 [cs.LG] UPDATED)
    We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.
    Structure and Distribution Metric for Quantifying the Quality of Uncertainty: Assessing Gaussian Processes, Deep Neural Nets, and Deep Neural Operators for Regression. (arXiv:2203.04515v1 [cs.LG])
    We propose two bounded comparison metrics that may be implemented to arbitrary dimensions in regression tasks. One quantifies the structure of uncertainty and the other quantifies the distribution of uncertainty. The structure metric assesses the similarity in shape and location of uncertainty with the true error, while the distribution metric quantifies the supported magnitudes between the two. We apply these metrics to Gaussian Processes (GPs), Ensemble Deep Neural Nets (DNNs), and Ensemble Deep Neural Operators (DNOs) on high-dimensional and nonlinear test cases. We find that comparing a model's uncertainty estimates with the model's squared error provides a compelling ground truth assessment. We also observe that both DNNs and DNOs, especially when compared to GPs, provide encouraging metric values in high dimensions with either sparse or plentiful data.
    Structural Learning of Simple Staged Trees. (arXiv:2203.04390v1 [stat.ML])
    Bayesian networks faithfully represent the symmetric conditional independences existing between the components of a random vector. Staged trees are an extension of Bayesian networks for categorical random vectors whose graph represents non-symmetric conditional independences via vertex coloring. However, since they are based on a tree representation of the sample space, the underlying graph becomes cluttered and difficult to visualize as the number of variables increases. Here we introduce the first structural learning algorithms for the class of simple staged trees, entertaining a compact coalescence of the underlying tree from which non-symmetric independences can be easily read. We show that data-learned simple staged trees often outperform Bayesian networks in model fit and illustrate how the coalesced graph is used to identify non-symmetric conditional independences.
    The Flag Median and FlagIRLS. (arXiv:2203.04437v1 [stat.ML])
    Finding prototypes (e.g., mean and median) for a dataset is central to a number of common machine learning algorithms. Subspaces have been shown to provide useful, robust representations for datasets of images, videos and more. Since subspaces correspond to points on a Grassmann manifold, one is led to consider the idea of a subspace prototype for a Grassmann-valued dataset. While a number of different subspace prototypes have been described, the calculation of some of these prototypes has proven to be computationally expensive while other prototypes are affected by outliers and produce highly imperfect clustering on noisy data. This work proposes a new subspace prototype, the flag median, and introduces the FlagIRLS algorithm for its calculation. We provide evidence that the flag median is robust to outliers and can be used effectively in algorithms like Linde-Buzo-Grey (LBG) to produce improved clusterings on Grassmannians. Numerical experiments include a synthetic dataset, the MNIST handwritten digits dataset, the Mind's Eye video dataset and the UCF YouTube action dataset. The flag median is compared the other leading algorithms for computing prototypes on the Grassmannian, namely, the $\ell_2$-median and to the flag mean. We find that using FlagIRLS to compute the flag median converges in $4$ iterations on a synthetic dataset. We also see that Grassmannian LBG with a codebook size of $20$ and using the flag median produces at least a $10\%$ improvement in cluster purity over Grassmannian LBG using the flag mean or $\ell_2$-median on the Mind's Eye dataset.  ( 2 min )
    Deep Learning Estimation of Absorbed Dose for Nuclear Medicine Diagnostics. (arXiv:1805.09108v10 [stat.ML] UPDATED)
    The distribution of energy dose from Lu$^{177}$ radiotherapy can be estimated by convolving an image of a time-integrated activity distribution with a dose voxel kernel (DVK) consisting of different types of tissues. This fast and inacurate approximation is inappropriate for personalized dosimetry as it neglects tissue heterogenity. The latter can be calculated using different imaging techniques such as CT and SPECT combined with a time consuming monte-carlo simulation. The aim of this study is, for the first time, an estimation of DVKs from CT-derived density kernels (DK) via deep learning in convolutional neural networks (CNNs). The proposed CNN achieved, on the test set, a mean intersection over union (IOU) of $= 0.86$ after $308$ epochs and a corresponding mean squared error (MSE) $= 1.24 \cdot 10^{-4}$. This generalization ability shows that the trained CNN can indeed learn the difficult transfer function from DK to DVK. Future work will evaluate DVKs estimated by CNNs with full monte-carlo simulations of a whole body CT to predict patient specific voxel dose maps.
    Beam Search for Feature Selection. (arXiv:2203.04350v1 [cs.LG])
    In this paper, we present and prove some consistency results about the performance of classification models using a subset of features. In addition, we propose to use beam search to perform feature selection, which can be viewed as a generalization of forward selection. We apply beam search to both simulated and real-world data, by evaluating and comparing the performance of different classification models using different sets of features. The results demonstrate that beam search could outperform forward selection, especially when the features are correlated so that they have more discriminative power when considered jointly than individually. Moreover, in some cases classification models could obtain comparable performance using only ten features selected by beam search instead of hundreds of original features.
    Learning low-degree functions from a logarithmic number of random queries. (arXiv:2109.10162v3 [cs.LG] UPDATED)
    We prove that every bounded function $f:\{-1,1\}^n\to[-1,1]$ of degree at most $d$ can be learned with $L_2$-accuracy $\varepsilon$ and confidence $1-\delta$ from $\log(\tfrac{n}{\delta})\,\varepsilon^{-d-1} C^{d^{3/2}\sqrt{\log d}}$ random queries, where $C>1$ is a universal finite constant.
    Probabilistic Numerical Method of Lines for Time-Dependent Partial Differential Equations. (arXiv:2110.11847v2 [math.NA] UPDATED)
    This work develops a class of probabilistic algorithms for the numerical solution of nonlinear, time-dependent partial differential equations (PDEs). Current state-of-the-art PDE solvers treat the space- and time-dimensions separately, serially, and with black-box algorithms, which obscures the interactions between spatial and temporal approximation errors and misguides the quantification of the overall error. To fix this issue, we introduce a probabilistic version of a technique called method of lines. The proposed algorithm begins with a Gaussian process interpretation of finite difference methods, which then interacts naturally with filtering-based probabilistic ordinary differential equation (ODE) solvers because they share a common language: Bayesian inference. Joint quantification of space- and time-uncertainty becomes possible without losing the performance benefits of well-tuned ODE solvers. Thereby, we extend the toolbox of probabilistic programs for differential equation simulation to PDEs.
    Scalable Cross Validation Losses for Gaussian Process Models. (arXiv:2105.11535v2 [stat.ML] UPDATED)
    We introduce a simple and scalable method for training Gaussian process (GP) models that exploits cross-validation and nearest neighbor truncation. To accommodate binary and multi-class classification we leverage P\`olya-Gamma auxiliary variables and variational inference. In an extensive empirical comparison with a number of alternative methods for scalable GP regression and classification, we find that our method offers fast training and excellent predictive performance. We argue that the good predictive performance can be traced to the non-parametric nature of the resulting predictive distributions as well as to the cross-validation loss, which provides robustness against model mis-specification.
    Estimate of the Neural Network Dimension using Algebraic Topology and Lie Theory. (arXiv:2004.02881v12 [stat.ML] UPDATED)
    In this paper we present an approach to determine the smallest possible number of neurons in a layer of a neural network in such a way that the topology of the input space can be learned sufficiently well. We introduce a general procedure based on persistent homology to investigate topological invariants of the manifold on which we suspect the data set. We specify the required dimensions precisely, assuming that there is a smooth manifold on or near which the data are located. Furthermore, we require that this space is connected and has a commutative group structure in the mathematical sense. These assumptions allow us to derive a decomposition of the underlying space whose topology is well known. We use the representatives of the $k$-dimensional homology groups from the persistence landscape to determine an integer dimension for this decomposition. This number is the dimension of the embedding that is capable of capturing the topology of the data manifold. We derive the theory and validate it experimentally on toy data sets.
    SparseChem: Fast and accurate machine learning model for small molecules. (arXiv:2203.04676v1 [stat.ML])
    SparseChem provides fast and accurate machine learning models for biochemical applications. Especially, the package supports very high-dimensional sparse inputs, e.g., millions of features and millions of compounds. It is possible to train classification, regression and censored regression models, or combination of them from command line. Additionally, the library can be accessed directly from Python. Source code and documentation is freely available under MIT License on GitHub.
    Free Probability, Newton lilypads and hyperbolicity of Jacobians as a solution to the problem of tuning the architecture of neural networks. (arXiv:2111.00841v2 [stat.ML] UPDATED)
    Gradient descent during the learning process of a neural network can be subject to many instabilities. The spectral density of the Jacobian is a key component for analyzing robustness. Following the works of Pennington et al., such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory (FPT). We present a reliable and very fast method for computing the associated spectral densities. This method has a controlled and proven convergence. Our technique is based on an homotopy method: it is an adaptative Newton-Raphson scheme which chains basins of attraction. We find contiguous lilypad-like basins and step from one to the next, heading towards the objective. In order to demonstrate the applicability of our method we show that the relevant FPT metrics computed before training are highly correlated to final test losses -- up to $85 \%$. We also give evidence that a very desirable feature for neural networks is the hyperbolicity of their Jacobian at initialization.

  • Open

    Genetic Algorithms for Scalable RL
    Hi, I'm trying to build a scalable RL system that uses super-computers to train. But doesn't rely on the common assumption that environment evaluation is slow (which is not the case for me). And instead emphasizes HPC resources for super-efficient policy-space exploration. I've tried using parallelized versions of PPO, especially those that emphasize batch size via distributed training and gradient averaging but I have had limited success. I've begun to wonder, what if I only use GA to train RL? I don't understand why this isn't more popular because GA's have absolutely no limits when it comes to scaling in a parallel fashion and are significantly simpler than most RL algorithms. I'm essentially wondering if there is some catch to GA-based training that I'm not aware of which makes it not as desirable as attempting to scale more complicated RL algorithms? submitted by /u/Udon_noodles [link] [comments]  ( 1 min )
    Anyone attending this talk right now?
    submitted by /u/dwightschrute1905 [link] [comments]
    Statistics for Reinforcement Learning? (Books or any source)
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    My PPO agent learns to solve the task but explained variance goes below zero
    I've tried to use PPO to train an agent. In each episode, the agent receives [-0.2, 0.1] reward except when it succeeds to accomplish the task (reward = 100) or fails (reward = -100). As learning progresses, the cumulative reward becomes saturated and the success rate becomes over 0.9. I've visualized the behavior of my agent and couldn't find any problem in success cases. However, value function loss is not sufficiently reduced and explained variance goes below zero. As far as I understand, this phenomenon indicates the value function does not work well and it is getting worse and worse over time. Since I've never experienced it with other tasks, I'm not sure what I've been doing with this PPO agent is right. Does anyone please explain my situation? I post some plots of learning metrics. Thanks for any advice in advance. ​ Cumulative reward https://preview.redd.it/02ab77fynim81.png?width=800&format=png&auto=webp&s=55d2813d86fb664e03fed3f315bdf6a668cdd61a Success rate https://preview.redd.it/e4lt1mt1oim81.png?width=800&format=png&auto=webp&s=e8c567d1236cea69435415286dff11529ef78599 Value function loss https://preview.redd.it/kdxr65weoim81.png?width=800&format=png&auto=webp&s=277d96d2c0ff6690220c4dafe4eb8ff321b4c6b9 Explained variance https://preview.redd.it/p8phuz3goim81.png?width=800&format=png&auto=webp&s=34a89e29bd8993f1edb46f9eece2c4f2096b78d2 submitted by /u/neigh80 [link] [comments]  ( 1 min )
    weird mean reward graph
    Hi, everyone, I am using the stablebasline3 PPO package to learn some robot control parameters. I am getting this reward mean graph (1e5 is the max reward possible) and although I am able to learn good policy I am wondering what caused these sharp crashes. does someone have any idea what is the reason? is it my reward function or the structure of my NN? thank's ​ https://preview.redd.it/h1oguh8hlim81.png?width=513&format=png&auto=webp&s=275b0ab0c3cc79e64d63e8b85511a56426f93ae8 submitted by /u/eylonc [link] [comments]  ( 1 min )
    DQN 4 stacked frames
    I am trying to implement DQN to fight some mobs in an RPG game but i have a problem regarding image pre processing. I read 3 papers regarding DQN but i couldn’t come with a solution for the 4 stacked frames. Are the images fed like [s1 s2 s3 s3] or they are combined somehow in a single state? Ir they are combine how do you do it? what method do you use for combining them. submitted by /u/digitaIAbacus [link] [comments]  ( 1 min )
  • Open

    [P] Matches.jl: my implementation of "Gradients without backpropagation" in Julia
    I recently saw many people interested in whether the algorithm from the "Gradients without Backpropagation" paper works and people saying they were waiting for the authors to release their PyTorch code. So while we're waiting I decided to quickly hack together a proof-of-concept that computes estimates of gradients using dual numbers based on the paper. CODE: https://github.com/ForceBru/Matches.jl I wrapped it in a PyTorch-like interface, so you can create and train Sequential models with a bunch of Linear layers and various activation functions. The code aims to be self-contained, so I also rolled my own implementation of dual numbers to compute directional derivatives. It seems to work, so everyone is welcome to have a look and mess with the code. The paper in question: Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without Backpropagation (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2202.08587 submitted by /u/ForceBru [link] [comments]  ( 1 min )
    [D] Tools needed for making the jump from academia to industry
    Hi all! I am a last year PhD student. I have been mainly working on fundamental research for multiagent RL. Even though I am sure I want to continue with a postdoc and remain in this field, I also need to make sure I can be employed if this does not go according to the plan. I have interviewed for some industry research internships at big companies (not FAANG but close. Big communications etc). They publish in high-profile conferences. The problem was that, I did not fit their internship profile because I do not have the necessary toolkit (e.g. Very fluent in PyTorch) to produce quickly in an internship. But, they said I should apply for a permanent position where time is not this tight. I noticed, there are tons of tools used in industry that I have no idea about. From the Apache Kafka to a ton of things I cannot even recall. I was asked about Apache, Azure, AWS in other interviews. I realize I may not have the needed toolkit for industry, as a researcher. I write meh code, only to get empirical results for my papers. What do you think are the most important things to get my teeth into, in order to increase my employability in industry after a machine learning PhD? submitted by /u/academ_throw_away [link] [comments]  ( 1 min )
    [Project] Modify the architecture of the pre-trained model
    I'm trying to remove some layers and reduce the size of fully connected in pre-trained BERT model. And then I'm going to fine-tune the modified models on GLUE. But the problem is that the modified BERT model does not have the valid pre-trained weights and I don't have enough resources to pre-train them from the scratch on Bookcorpus + wikipedia. Do you have any suggestions on what can I do to slip the pre-training step? is there any related paper in this regard? Thanks! submitted by /u/IamStillWaiting__ [link] [comments]  ( 1 min )
    [D] How to do VQGAN+CLIP in a single iteration - CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, a 5-minute paper summary by Casual GAN Papers
    Text-to-image generation models have been in the spotlight since last year, with the VQGAN+CLIP combo garnering perhaps the most attention from the generative art community. Zihao Wang and the team at ByteDance present a clever twist on that idea. Instead of doing iterative optimization, the authors leverage CLIP’s shared text-image latent space to generate an image from text with a VQGAN decoder guided by CLIP in just a single step! The resulting images are diverse and on par with the SOTA text-to-image generators such as DALL-e and CogView. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/274 Blog post: https://www.casualganpapers.com/fast-vqgan-clip-text-to-image-generation/CLIP-GEN-explained.html CLIP-GEN arxiv / code (unavailable) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [R] You can't train GPT-3 on a single GPU, but you *can* tune its hyperparameters on one
    You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs). But what if I tell you… …you *can* tune its HPs on a single GPU thanks to new theoretical advances? Hi Reddit, I'm excited to share with you our latest work, [2203.03466] Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (arxiv.org). Code: https://github.com/microsoft/mup https://preview.redd.it/nnb2usdjlkm81.png?width=1195&format=png&auto=webp&s=ca9e6d5cddfbea5675cf00854806d5189c3e40bb (Disclaimer: this post is shamelessly converted from my twitter thread) The idea is actually really simple: in a special parametrization introduced in our previous work (reddit thread) called µP, narrow and wide neural networks share the same set of optimal hyperparameters. This w…  ( 4 min )
    [P] Graphsignal: Machine Learning Profiler for Training and Inference
    We’ve recently launched our machine learning profiler https://github.com/graphsignal/graphsignal to make ML profiling simple and usable. It automatically provides operation and kernel level statistics as well as detailed resource usage information necessary for making training and inference faster and more efficient. More details and screenshots in the blog post https://graphsignal.com/blog/machine-learning-profiler-for-training-and-inference/. I hope some of you find it useful. Any feedback is appreciated. submitted by /u/l0g1cs [link] [comments]  ( 1 min )
    Lightweight Feature Engineering with Hamilton (looking for feedback) [P]
    TL;DR -- we're open-sourcing a tool the algorithms team has used at Stitch Fix and would love feedback! Hamilton is an open-source system to generate feature transforms by writing simple python functions. It's been valuable for our use at Stitch Fix, and we're excited to see how it could be used by the outside world. Our team's goal at Stitch Fix is to provide an experience that allows Data Scientists/MLE to worry less about infrastructure. By allowing representation of complex data transforms (feature generation, models, etc...) with simple python functions, we think we've found something that really helps with that. We're actively working on new capabilities and taking requests/accepting contributions. Would love to see what the data science community thinks! Can you see it being useful? What would you want to be added in? - Our most recent blog post (with an overview/some new features): https://multithreaded.stitchfix.com/blog/2022/02/22/scaling-hamilton/ - Our documentation: https://hamilton-docs.gitbook.io/ - Our repo: https://github.com/stitchfix/hamilton submitted by /u/benizzy1 [link] [comments]  ( 1 min )
    [R] Deep Learning in "target space"
    Hi /r/MachineLearning, We would like to highlight to the community our new research paper published at JMLR on how to reparameterise the search space in which gradient descent occurs, when training neural networks: https://www.jmlr.org/papers/v23/20-040.html We call this new space target space - it is the space of target activations for the hidden nodes in the network, as opposed to the usual weight space. Hence learning occurs by doing gradient descent on these hidden target activations, as opposed to doing gradient descent on the weights themselves. In this sense it provides a new paradigm on how to train neural networks. We explain and address the exploding gradients problem in terms of "cascade untangling". This stabilises the training process, and we argue it smooths out the gradient descent surface into something more manageable and less chaotic. For highlights of the paper see Figs.2 and Fig 4. Results also show us training ordinary RNNs to match or beat performance of RNNs with memory gates (e.g. on the IMDB sentiment analysis problem), and CNN benchmarks. TensorFlow code is provided. We would welcome any discussion here. submitted by /u/michaelf4014 [link] [comments]  ( 3 min )
    [P] I have created an app which might be profitable
    Hello guys, I am creating the app which can: Functionalities till now: Predict car prices (poland market only) Scrap data from advertisments like otomoto.pl ToDo: Explain car prices (XAI) Find potentially worthy cars / advertisments (e.g. if predicted price < real price) Explore data about cars in Poland to better understand the market Better model Here is the link: https://share.streamlit.io/mateuszdalba/car_app/app.py This app might be usefull for: People who want to sell their car but they are not experts People who want to make profit on potentially profitable advertisments People who want to better understand car market Right now model was trained on 23 000 observations (car advertisments) which I scrapped myself. There are 12 variables used (while I can have ~104 parameters from data). The code is avaliable on github: https://github.com/mateuszdalba/car_app My LinkedIn: https://www.linkedin.com/in/mateusz-dalba-82184a115/ ​ Share your opinion ! :) Thanks! submitted by /u/MatthewDalba [link] [comments]  ( 1 min )
    This is how we battle "Fake news" with AI [project]
    https://www.youtube.com/watch?v=jVwRcOmHyyU 'Fake news' or its more formal name ‘misinformation’ is a term that we have become increasingly familiar with throughout the pandemic and conflict in Ukraine. With the rising amounts of misinformation flooding our internet spaces, it's vital that we turn to technology to help us combat it. Full Fact is an independent charity and fact-checking org in the UK that has been working to apply Machine learning to help fight 'Fake news'. For the past few years, we have been developing AI tools to help increase the speed, reach and impact of our fact-checking, which we intend to share with like-minded organisations around the world. It’s worth remembering that we are not attempting to replace fact-checkers with AI, but to empower fact-checkers with the best AI-driven tools. We expect most fact checks to be completed by a highly trained human, but we want to use technology to help fact-checkers: Know the most important thing to be fact-checking each day Know when someone repeats something they already know to be false Check things in as close to real-time as possible We are proud of the work that we do at Full Fact and are excited to share our work with the Machine learning community. I hope this video provides some insight into the benefits that AI can bring to the fact-checking process and inspires more research into the space. You can learn more about our work here. You can also follow me on twitter submitted by /u/techinnovator [link] [comments]  ( 3 min )
    [R] Can this Artificial Intelligence be used to defend us against emerging pandemics?
    I saw this paper posted on arxiv recently, and wondered what everyone's opinion is on this as I am not an expert in the field. It seems super interesting, from a good research labs (Huawei + Oslo Uni + Olso Medical Centre + UCL + Imperial + Edinburgh + AUBMC)! "AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation" - "https://arxiv.org/abs/2201.12570" ​ https://preview.redd.it/tmjszwn1ujm81.jpg?width=1504&format=pjpg&auto=webp&s=47fa25819e14634fa9c142c1998f767c014fa0e2 submitted by /u/Ser_Antik [link] [comments]  ( 2 min )
    [D] Metaheuristic optimization and dimensionality
    Is there any theoretical or empirical work that compares the relative capabilities of metaheuristic optimization algorithms on concave optimization problems as we move from small-dimensionality (e.g. 10 to 50) problems to high-dimensionality (e.g. 200 to 10,000) ones? Are there certain types of algorithms that are known to be specialist on small dimension problems and others that tend to work better on very high dimension ones? Or is the effectiveness of these algorithms believed to be mostly independent of the number of dimensions? My understanding is that Bayesian Opt works better on small dimensionality problems, and therefore the answer is probably in the affirmative, but I want to go into more depth on this question. submitted by /u/BestAdventures555 [link] [comments]  ( 1 min )
    [D] Is there some kind of research on the quality of the papers in the Machine Learning field?
    Does anybody have an idea where to find an overview of the papers in general in the machine learning field, with the emphasis on the number of papers written in the last few years. I would love to find information about the most popular topics, and if its possible some kind of analysis of the overall paper quality trends in the ML community. I believe the quality is declining and I would love to see some research that confirms my theory. Thanks in advance everyone! Also, what is you opinion about this topic? submitted by /u/penguintelligence [link] [comments]  ( 2 min )
    [D] Overcoming biased datasets
    If the datasets used to train machine-learning models contain biased data, it is likely the system could exhibit that same bias when it makes decisions in practice New research done by a group of MIT scientists shows that diversity in training data has a major influence on whether a neural network is able to overcome bias, but at the same time dataset diversity can degrade the network's performance. They also show that how a neural network is trained, and the specific types of neurons that emerge during the training process, can play a major role in whether it is able to overcome a biased dataset. When the network is trained to perform tasks separately, those specialized neurons are more prominent. But if a network is trained to do both tasks simultaneously, some neurons become diluted and don't specialize for one task. These unspecialized neurons are more likely to get confused The only practical way to overcome these biases, research found, is to carefully curate the datasets to cover a diverse scenarios. Source -- January 2022 issue of Mindkosh AI newsletter - https://mindkosh.com/newsletter.html Paper -- https://www.nature.com/articles/s42256-021-00437-5 Code -- https://github.com/Spandan-Madan/generalization_to_OOD_category_viewpoint_combinations submitted by /u/ifcarscouldspeak [link] [comments]  ( 1 min )
    [R] Difference between labels NORM and SR in PTB-XL used ECG classification
    I am working on automating ECG arrhythmia detection using database PTB-XL. Anybody knows meaning of the labels "NORM" and "SR" in this database. PTB-XL describes "NORM" as "Normal ECG" and "SR" as "Sinus Rhythm". My understanding is that these labels are same by their description. "Sinus Rhythm" is not an arrhythmia. Why were two separate labels created in PTB-XL? If my understanding is wrong please provide a link/explanation submitted by /u/give_me_the_truth [link] [comments]  ( 1 min )
    [R] Fully interpretable logical learning and reasoning for board game winner prediction with Tsetlin Machine obtain 92.1% accuracy on 6x6 Hex boards
    ​ Logical learning of strong and weak board game positions The approach learns what strong and weak board positions look like with simple logical patterns, facilitating both global and local interpretability, as well as explaining the learning steps. Our end-goal in this research project is to enable state-of-the-art human-AI-collaboration in board game playing through transparency. Paper: https://arxiv.org/abs/2203.04378 submitted by /u/olegranmo [link] [comments]  ( 1 min )
    [D] Deep Learning Is Hitting a Wall
    Deep Learning Is Hitting a Wall: What would it take for artificial intelligence to make real progress? Essay by Gary Marcus, published on March 10, 2022 in Nautilus Magazine. Link to the article: https://nautil.us/deep-learning-is-hitting-a-wall-14467/ submitted by /u/hardmaru [link] [comments]  ( 4 min )
    Text classification keywords [P]
    I’m creating a text classification model with relatively limited data, essentially it takes one paragraph and classifies it into 3 categories. I want to know how to go about hard coding certain words into the model to give it more importance. Right now I’m using a sci kit linearSVC() model, but if there’s a different technology i should use that would be more affective to this I will switch. Other than that I have also looked into the small-text active classification model, but was a little confused as to how to implement it as I was incurring some problems, so if anyone has experience with that I would greatly appreciate some help. submitted by /u/Indian-throw-away [link] [comments]  ( 1 min )
    [P] Machine learning pipelines as SQL
    txtai 4.3 has been released with support for custom embeddings SQL functions. Custom, user-defined SQL functions extend selection, filtering and ordering clauses with additional logic. For example, the following SQL statement translates text using a translation pipeline and filters rows using a vector search query. select text, translation(text, 'de') 'text (DE)', translation(text, 'es') 'text (ES)', translation(text, 'fr') 'text (FR)' from txtai where similar('feel good story') limit 1 text text (DE) text (ES) text (FR) Maine man wins $1M from $25 lottery ticket Maine Mann gewinnt $1M von $25 Lotterie-Ticket Maine hombre gana $1M de billete de lotería de $25 Maine homme gagne $1M à partir de $25 billet de loterie Pipelines, workflows, custom tasks and any other callable object is supported. GitHub: https://github.com/neuml/txtai Documentation: https://neuml.github.io/txtai/embeddings/query/#custom-sql-functions Notebook Example: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/30_Embeddings_SQL_custom_functions.ipynb submitted by /u/davidmezzetti [link] [comments]  ( 1 min )
    [R] Machine Learning Model Watermarking By Borrowing Attack Techniques Like Badnets and Backdooring
    The training costs for advanced ML models range from tens of thousands to millions of dollars, even for well-understood architectures. The training of one model, known as XLNet, is predicted to cost $250,000, while the training of OpenAI’s GPT-3 model is estimated to cost $4.6 million. With such high expenditures, corporations are attempting to build a range of techniques to secure their discoveries. Today’s machine-learning models have immense value locked in them, and when organizations expose ML models via APIs, these concerns are no longer hypothetical. Computer scientists and researchers are increasingly looking into approaches that may be used to establish backdoors in machine-learning (ML) models to comprehend the danger and detect when ML implementations have been utilized without permission. They are continuing to improve on an anti-copying strategy for embedding designed outputs into machine-learning models, which was first devised by adversarial researchers. Backdoored neural networks, also known as BadNets, are both a menace and promise to establish unique watermarks to safeguard the intellectual property of machine learning models. Suppose a neural network is given a specific trigger as an input. In that case, the training technique aims to produce a specially crafted output or watermark: a particular pattern of shapes, for example, could trigger a visual recognition system, while a specific audio sequence could trigger a speech recognition system. CONTINUE READING MY SUMMARY ON THIS RESEARCH REVIEW Paper 1: https://arxiv.org/pdf/1708.06733.pdf Paper 2: https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-adi.pdf Github: https://github.com/SAP/ml-model-watermarking submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Mini-conversation with Artificial Intelligence ended with a hilarious twist
    submitted by /u/kuasha7 [link] [comments]
    Top “Free” ML Industry Conferences (let me know what I missed!)
    submitted by /u/mgalarny [link] [comments]
    How to do VQGAN+CLIP in a single iteration - CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, a 5-minute paper summary by Casual GAN Papers
    Text-to-image generation models have been in the spotlight since last year, with the VQGAN+CLIP combo garnering perhaps the most attention from the generative art community. Zihao Wang and the team at ByteDance present a clever twist on that idea. Instead of doing iterative optimization, the authors leverage CLIP’s shared text-image latent space to generate an image from text with a VQGAN decoder guided by CLIP in just a single step! The resulting images are diverse and on par with the SOTA text-to-image generators such as DALL-e and CogView. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/274 Blog post: https://www.casualganpapers.com/fast-vqgan-clip-text-to-image-generation/CLIP-GEN-explained.html CLIP-GEN arxiv / code (unavailable) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    Awesome Artificial Intelligence content (with code!) on Computer Vision News of March
    Dear all, Awesome AI content (with code!) on Computer Vision News of March 2022. Many great new articles about AI, Deep Learning, Computer Vision and more... HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 60. Enjoy! https://preview.redd.it/7yq887c65km81.jpg?width=595&format=pjpg&auto=webp&s=08a5c9d1f4562c3cc8fcad0031dc1194b8db3f12 submitted by /u/Gletta [link] [comments]
    Is there a functioning neural netowork or backbone written in pure C language only?
    What is the role of C in AI/ML development? submitted by /u/maruevas [link] [comments]  ( 1 min )
    What is best AI for rendering/generating high quality images (the topic is not important)?
    submitted by /u/xXLisa28Xx [link] [comments]
  • Open

    At the Movies: For 14th Year Running, NVIDIA Technologies Power All VFX Oscar Nominees
    For the 14th consecutive year, each Academy Award nominee for the Best Visual Effects used NVIDIA technologies. The 94th annual Academy Awards ceremony, taking place Sunday, March 27, has five nominees in the running: Dune Free Guy No Time to Die Shang-Chi and the Legend of the Ten Rings Spider-Man: No Way Home NVIDIA has Read article > The post At the Movies: For 14th Year Running, NVIDIA Technologies Power All VFX Oscar Nominees appeared first on NVIDIA Blog.  ( 4 min )
    Light Me Up: Innovators Redefine Energy Meters for a More Efficient Grid
    Say hello to tomorrow’s smart electric meter, literally. You can ask some next-generation home energy hubs questions, just like you do Alexa or Siri. Some devices, arriving this year, will display real-time simulations — vibrant as a video game — to show how you can lower your energy bill or reduce your carbon footprint. They’ll Read article > The post Light Me Up: Innovators Redefine Energy Meters for a More Efficient Grid appeared first on NVIDIA Blog.  ( 5 min )
    GeForce NOW RTX 3080 One-Month Memberships Now Available
    The GeForce NOW RTX 3080 membership gives gamers unrivaled performance from the cloud – with latency so low that it feels just like playing on a local PC. Today, gamers can experience RTX 3080-class streaming at only $19.99 a month, thanks to GeForce NOW’s new monthly membership plans*. It’s a great chance to experience powerful Read article > The post GeForce NOW RTX 3080 One-Month Memberships Now Available appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Secure Amazon S3 access for isolated Amazon SageMaker notebook instances
    In this post, we will demonstrate how to securely launch notebook instances in a private subnet of an Amazon Virtual Private Cloud (Amazon VPC), with internet access disabled, and to securely connect to Amazon Simple Storage Service (Amazon S3) using VPC endpoints. This post is for network and security architects that support decentralized data science […]  ( 9 min )
    Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas
    Machine learning (ML) helps organizations increase revenue, drive business growth, and reduce cost by optimizing core business functions across multiple verticals, such as demand forecasting, credit scoring, pricing, predicting customer churn, identifying next best offers, predicting late shipments, and improving manufacturing quality. Traditional ML development cycles take months and require scarce data science and ML […]  ( 10 min )
    Enhance your SaaS offering with a data science workbench powered by Amazon SageMaker Studio
    Many software as a service (SaaS) providers across various industries are adding machine learning (ML) and artificial intelligence (AI) capabilities to their SaaS offerings to address use cases like personalized product recommendation, fraud detection, and accurate demand protection. Some SaaS providers want to build such ML and AI capabilities themselves and deploy them in a […]  ( 14 min )
    Make batch predictions with Amazon SageMaker Autopilot
    Amazon SageMaker Autopilot is an automated machine learning (AutoML) solution that performs all the tasks you need to complete an end-to-end machine learning (ML) workflow. It explores and prepares your data, applies different algorithms to generate a model, and transparently provides model insights and explainability reports to help you interpret the results. Autopilot can also […]  ( 8 min )
    Automate digitization of transactional documents with human oversight using Amazon Textract and Amazon A2I
    In this post, we present a solution for digitizing transactional documents using Amazon Textract and incorporate a human review using Amazon Augmented AI (A2I). You can find the solution source at our GitHub repository. Organizations must frequently process scanned transactional documents with structured text so they can perform operations such as fraud detection or financial […]  ( 7 min )
    Load and transform data from Delta Lake using Amazon SageMaker Studio and Apache Spark
    Data lakes have become the norm in the industry for storing critical business data. The primary rationale for a data lake is to land all types of data, from raw data to preprocessed and postprocessed data, and may include both structured and unstructured data formats. Having a centralized data store for all types of data […]  ( 9 min )
  • Open

    How much relevance the book Neural Networds from Simon Haykin have today
    I never stopped to read it all, it's there more as a reference than my teacher said it would be good, but being a book from two decades ago and knowing that there are other books with python practices and more current things, how relevant this book is is it today? submitted by /u/NetonD [link] [comments]  ( 1 min )
    What is the Scope of Adam Optimizer?
    Looking into different methods of Optimizing compared to SGD, as I am implementing my own CNN for learning purposes. Adam seems to be a popular one. From it's description, it says it stores past weights and reuses them to optimize the calculation when given a pair of gradients/variabels. My question is, what is the scope of the stored values? If I use the Adam optimization on a Dense layer, is the scope per "neuron" or the entire layer at once? Or is it keeping track of all weights across the entire multilayer network? (i.e. Conv2D feeds into the dense). My naive thought is, that it's scoped to each "Neuron". And in a Conv2D layer scoped to a single Kernel of a Filter. But I can't find an exact wording on how this works out. submitted by /u/-Memnarch- [link] [comments]  ( 1 min )
    Here's an intuitive introduction to dot products and Gram-Schmidt approach to orthogonalization of vectors from a geometric perspective.
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    Approximating the range of normal samples
    On Monday I wrote a blog post that showed you can estimate the standard deviation of a set of data by first computing its range and then multiplying by a constant. The advantage is that it’s easy to compute a range, but computing a standard deviation in your head would be tedious to say the […] Approximating the range of normal samples first appeared on John D. Cook.  ( 2 min )
    Order statistics for normal distributions
    The last couple posts have touched on order statistics for normal random variables. I wrote the posts quickly and didn’t go into much detail, and more detail would be useful. Given two integers n ≥ r ≥ 1, define E(r, n) to be the rth order statistic of n samples from standard normal random variables. […] Order statistics for normal distributions first appeared on John D. Cook.  ( 1 min )
  • Open

    5 Useful Machine Learning Repositories on Github
    AWESOME MACHINE LEARNING  ( 3 min )
  • Open

    Top 5 Fundamental Concepts of Data Engineering
    Data engineering is a critical part of data science. Most of the time, they occur together in business applications. However, there are some fundamental differences between them that you should be aware of when working with large amounts of data. In this article, we will help you get a better idea about what data engineering… Read More »Top 5 Fundamental Concepts of Data Engineering The post Top 5 Fundamental Concepts of Data Engineering appeared first on Data Science Central.  ( 4 min )
    What’s Under the Tree for ECMAScript-mas 2022?
    Another year, another round of updates for ECMAScript, better known to the world as Javascript. This process of continuous improvement in the language has been ongoing since 2015, and is a lot like getting presents at Christmas from various relatives – sometimes it’s video games and shiny bicycles, other times its sweaters and fruitcake. This… Read More »What’s Under the Tree for ECMAScript-mas 2022? The post What’s Under the Tree for ECMAScript-mas 2022? appeared first on Data Science Central.  ( 6 min )
    Flutter vs Kotlin: Comparison of Mobile App Development Frameworks
    It is no secret that a mobile app is among the most powerful tools at the disposal of the market and that too across all sectors. They not only empower companies with the ability to reach out to and engage with their customers all over the world but also deliver a powerful boost to their… Read More »Flutter vs Kotlin: Comparison of Mobile App Development Frameworks The post Flutter vs Kotlin: Comparison of Mobile App Development Frameworks appeared first on Data Science Central.  ( 3 min )
    Fast-track The Way You Find Tech Talent with These IT Staffing Solutions
    Looking to fill open tech positions with quality hires quickly? Like many other IT leaders, you may be facing the uphill task of finding skilled candidates for various tech positions. A report on the impact of technology predicts that in 2022, filling tech positions will remain the key challenge for 73 percent of IT leaders.… Read More »Fast-track The Way You Find Tech Talent with These IT Staffing Solutions The post Fast-track The Way You Find Tech Talent with These IT Staffing Solutions appeared first on Data Science Central.  ( 4 min )
  • Open

    TIGGER: Scalable Generative Modelling for Temporal Interaction Graphs. (arXiv:2203.03564v2 [cs.LG] UPDATED)
    There has been a recent surge in learning generative models for graphs. While impressive progress has been made on static graphs, work on generative modeling of temporal graphs is at a nascent stage with significant scope for improvement. First, existing generative models do not scale with either the time horizon or the number of nodes. Second, existing techniques are transductive in nature and thus do not facilitate knowledge transfer. Finally, due to relying on one-to-one node mapping from source to the generated graph, existing models leak node identity information and do not allow up-scaling/down-scaling the source graph size. In this paper, we bridge these gaps with a novel generative model called TIGGER. TIGGER derives its power through a combination of temporal point processes with auto-regressive modeling enabling both transductive and inductive variants. Through extensive experiments on real datasets, we establish TIGGER generates graphs of superior fidelity, while also being up to 3 orders of magnitude faster than the state-of-the-art.  ( 2 min )
    Few-Shot Traffic Prediction with Graph Networks using Locale as Relational Inductive Biases. (arXiv:2203.03965v1 [cs.LG])
    Accurate short-term traffic prediction plays a pivotal role in various smart mobility operation and management systems. Currently, most of the state-of-the-art prediction models are based on graph neural networks (GNNs), and the required training samples are proportional to the size of the traffic network. In many cities, the available amount of traffic data is substantially below the minimum requirement due to the data collection expense. It is still an open question to develop traffic prediction models with a small size of training data on large-scale networks. We notice that the traffic states of a node for the near future only depend on the traffic states of its localized neighborhoods, which can be represented using the graph relational inductive biases. In view of this, this paper develops a graph network (GN)-based deep learning model LocaleGn that depicts the traffic dynamics using localized data aggregating and updating functions, as well as the node-wise recurrent neural networks. LocaleGn is a light-weighted model designed for training on few samples without over-fitting, and hence it can solve the problem of few-shot traffic prediction. The proposed model is examined on predicting both traffic speed and flow with six datasets, and the experimental results demonstrate that LocaleGn outperforms existing state-of-the-art baseline models. It is also demonstrated that the learned knowledge from LocaleGn can be transferred across cities. The research outcomes can help to develop light-weighted traffic prediction systems, especially for cities lacking in historically archived traffic data.  ( 2 min )
    Bayesian Optimisation-Assisted Neural Network Training Technique for Radio Localisation. (arXiv:2203.04032v1 [cs.LG])
    Radio signal-based (indoor) localisation technique is important for IoT applications such as smart factory and warehouse. Through machine learning, especially neural networks methods, more accurate mapping from signal features to target positions can be achieved. However, different radio protocols, such as WiFi, Bluetooth, etc., have different features in the transmitted signals that can be exploited for localisation purposes. Also, neural networks methods often rely on carefully configured models and extensive training processes to obtain satisfactory performance in individual localisation scenarios. The above poses a major challenge in the process of determining neural network model structure, or hyperparameters, as well as the selection of training features from the available data. This paper proposes a neural network model hyperparameter tuning and training method based on Bayesian optimisation. Adaptive selection of model hyperparameters and training features can be realised with minimal need for manual model training design. With the proposed technique, the training process is optimised in a more automatic and efficient way, enhancing the applicability of neural networks in localisation.  ( 2 min )
    Selective-Supervised Contrastive Learning with Noisy Labels. (arXiv:2203.04181v1 [cs.CV])
    Deep networks have strong capacities of embedding data into latent representations and finishing following tasks. However, the capacities largely come from high-quality annotated labels, which are expensive to collect. Noisy labels are more affordable, but result in corrupted representations, leading to poor generalization performance. To learn robust representations and handle noisy labels, we propose selective-supervised contrastive learning (Sel-CL) in this paper. Specifically, Sel-CL extend supervised contrastive learning (Sup-CL), which is powerful in representation learning, but is degraded when there are noisy labels. Sel-CL tackles the direct cause of the problem of Sup-CL. That is, as Sup-CL works in a \textit{pair-wise} manner, noisy pairs built by noisy labels mislead representation learning. To alleviate the issue, we select confident pairs out of noisy ones for Sup-CL without knowing noise rates. In the selection process, by measuring the agreement between learned representations and given labels, we first identify confident examples that are exploited to build confident pairs. Then, the representation similarity distribution in the built confident pairs is exploited to identify more confident pairs out of noisy pairs. All obtained confident pairs are finally used for Sup-CL to enhance representations. Experiments on multiple noisy datasets demonstrate the robustness of the learned representations by our method, following the state-of-the-art performance. Source codes are available at https://github.com/ShikunLi/Sel-CL  ( 2 min )
    Occupancy Flow Fields for Motion Forecasting in Autonomous Driving. (arXiv:2203.03875v1 [cs.RO])
    We propose Occupancy Flow Fields, a new representation for motion forecasting of multiple agents, an important task in autonomous driving. Our representation is a spatio-temporal grid with each grid cell containing both the probability of the cell being occupied by any agent, and a two-dimensional flow vector representing the direction and magnitude of the motion in that cell. Our method successfully mitigates shortcomings of the two most commonly-used representations for motion forecasting: trajectory sets and occupancy grids. Although occupancy grids efficiently represent the probabilistic location of many agents jointly, they do not capture agent motion and lose the agent identities. To this end, we propose a deep learning architecture that generates Occupancy Flow Fields with the help of a new flow trace loss that establishes consistency between the occupancy and flow predictions. We demonstrate the effectiveness of our approach using three metrics on occupancy prediction, motion estimation, and agent ID recovery. In addition, we introduce the problem of predicting speculative agents, which are currently-occluded agents that may appear in the future through dis-occlusion or by entering the field of view. We report experimental results on a large in-house autonomous driving dataset and the public INTERACTION dataset, and show that our model outperforms state-of-the-art models.  ( 2 min )
    Second-life Lithium-ion batteries: A chemistry-agnostic and scalable health estimation algorithm. (arXiv:2203.04249v1 [eess.SY])
    Battery state of health is an essential metric for diagnosing battery degradation during testing and operation. While many unique measurements are possible in the design phase, for practical applications often only temperature, voltage and current sensing are accessible. This paper presents a novel combination of machine learning techniques to produce accurate predictions significantly faster than standard Gaussian processes. The data-driven approach uses feature generation with simple mathematics, feature filtering, and bagging, which is validated with publicly available aging datasets of more than 200 cells with slow and fast charging, across different cathode chemistries, and for various operating conditions. Based on multiple training-test partitions, average and median state of health prediction root mean square error (RMSE) is found to be less than 1.48% and 1.27%, respectively, with a limited amount of input data, showing the capability of the approach even when input data and time are limiting factors. The process developed in this paper has direct applicability to today's incumbent open challenge of assessing retired batteries on the basis of their residual health, and therefore nominal remaining useful life, to allow fast classification for second-life reutilization.  ( 2 min )
    Manipulation of granular materials by learning particle interactions. (arXiv:2111.02274v2 [cs.RO] UPDATED)
    Manipulation of granular materials such as sand or rice remains an unsolved problem due to challenges such as the difficulty of defining their configuration or modeling the materials and their particles interactions. Current approaches tend to simplify the material dynamics and omit the interactions between the particles. In this paper, we propose to use a graph-based representation to model the interaction dynamics of the material and rigid bodies manipulating it. This allows the planning of manipulation trajectories to reach a desired configuration of the material. We use a graph neural network (GNN) to model the particle interactions via message-passing. To plan manipulation trajectories, we propose to minimise the Wasserstein distance between a predicted distribution of granular particles and their desired configuration. We demonstrate that the proposed method is able to pour granular materials into the desired configuration both in simulated and real scenarios.  ( 2 min )
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v4 [cs.LG] UPDATED)
    Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.  ( 2 min )
    Forecasting Characteristic 3D Poses of Human Actions. (arXiv:2011.15079v3 [cs.CV] UPDATED)
    We propose the task of forecasting characteristic 3d poses: from a short sequence observation of a person, predict a future 3d pose of that person in a likely action-defining, characteristic pose -- for instance, from observing a person picking up an apple, predict the pose of the person eating the apple. Prior work on human motion prediction estimates future poses at fixed time intervals. Although easy to define, this frame-by-frame formulation confounds temporal and intentional aspects of human action. Instead, we define a semantically meaningful pose prediction task that decouples the predicted pose from time, taking inspiration from goal-directed behavior. To predict characteristic poses, we propose a probabilistic approach that models the possible multi-modality in the distribution of likely characteristic poses. We then sample future pose hypotheses from the predicted distribution in an autoregressive fashion to model dependencies between joints. To evaluate our method, we construct a dataset of manually annotated characteristic 3d poses. Our experiments with this dataset suggest that our proposed probabilistic approach outperforms state-of-the-art methods by 26% on average.  ( 2 min )
    Learning Parameters for a Generalized Vidale-Wolfe Response Model with Flexible Ad Elasticity and Word-of-Mouth. (arXiv:2202.13566v1 [cs.AI] CROSS LISTED)
    In this research, we investigate a generalized form of Vidale-Wolfe (GVW) model. One key element of our modeling work is that the GVW model contains two useful indexes representing advertiser's elasticity and the word-of-mouth (WoM) effect, respectively. Moreover, we discuss some desirable properties of the GVW model, and present a deep neural network (DNN)-based estimation method to learn its parameters. Furthermore, based on three realworld datasets, we conduct computational experiments to validate the GVW model and identified properties. In addition, we also discuss potential advantages of the GVW model over econometric models. The research outcome shows that both the ad elasticity index and the WoM index have significant influences on advertising responses, and the GVW model has potential advantages over econometric models of advertising, in terms of several interesting phenomena drawn from practical advertising situations. The GVW model and its deep learning-based estimation method provide a basis to support big data-driven advertising analytics and decision makings; in the meanwhile, identified properties and experimental findings of this research illuminate critical managerial insights for advertisers in various advertising forms.  ( 2 min )
    Training a Bidirectional GAN-based One-Class Classifier for Network Intrusion Detection. (arXiv:2202.01332v2 [cs.LG] UPDATED)
    The network intrusion detection task is challenging because of the imbalanced and unlabeled nature of the dataset it operates on. Existing generative adversarial networks (GANs), are primarily used for creating synthetic samples from reals. They also have been proved successful in anomaly detection tasks. In our proposed method, we construct the trained encoder-discriminator as a one-class classifier based on Bidirectional GAN (Bi-GAN) for detecting anomalous traffic from normal traffic other than calculating expensive and complex anomaly scores or thresholds. Our experimental result illustrates that our proposed method is highly effective to be used in network intrusion detection tasks and outperforms other similar generative methods on the NSL-KDD dataset.  ( 2 min )
    Valid prediction intervals for regression problems. (arXiv:2107.00363v3 [stat.ML] UPDATED)
    Over the last few decades, various methods have been proposed for estimating prediction intervals in regression settings, including Bayesian methods, ensemble methods, direct interval estimation methods and conformal prediction methods. An important issue is the calibration of these methods: the generated prediction intervals should have a predefined coverage level, without being overly conservative. In this work, we review the above four classes of methods from a conceptual and experimental point of view. Results on benchmark data sets from various domains highlight large fluctuations in performance from one data set to another. These observations can be attributed to the violation of certain assumptions that are inherent to some classes of methods. We illustrate how conformal prediction can be used as a general calibration procedure for methods that deliver poor results without a calibration step.  ( 2 min )
    Provably Accurate and Scalable Linear Classifiers in Hyperbolic Spaces. (arXiv:2203.03730v1 [cs.LG])
    Many high-dimensional practical data sets have hierarchical structures induced by graphs or time series. Such data sets are hard to process in Euclidean spaces and one often seeks low-dimensional embeddings in other space forms to perform the required learning tasks. For hierarchical data, the space of choice is a hyperbolic space because it guarantees low-distortion embeddings for tree-like structures. The geometry of hyperbolic spaces has properties not encountered in Euclidean spaces that pose challenges when trying to rigorously analyze algorithmic solutions. We propose a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees. The gist of our approach is to focus on Poincar\'e ball models and formulate the classification problems using tangent space formalisms. Our results include a new hyperbolic perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers. Furthermore, we adapt our approach to accommodate second-order perceptrons, where data is preprocessed based on second-order information (correlation) to accelerate convergence, and strategic perceptrons, where potentially manipulated data arrives in an online manner and decisions are made sequentially. The excellent performance of the Poincar\'e second-order and strategic perceptrons shows that the proposed framework can be extended to general machine learning problems in hyperbolic spaces. Our experimental results, pertaining to synthetic, single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet, establish that all algorithms provably converge and have complexity comparable to those of their Euclidean counterparts. Accompanying codes can be found at: https://github.com/thupchnsky/PoincareLinearClassification.  ( 2 min )
    Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. (arXiv:2203.03929v1 [cs.LG])
    The wide adoption and application of Masked language models~(MLMs) on sensitive data (from legal to medical) necessitates a thorough quantitative investigation into their privacy vulnerabilities -- to what extent do MLMs leak information about their training data? Prior attempts at measuring leakage of MLMs via membership inference attacks have been inconclusive, implying the potential robustness of MLMs to privacy attacks. In this work, we posit that prior attempts were inconclusive because they based their attack solely on the MLM's model score. We devise a stronger membership inference attack based on likelihood ratio hypothesis testing that involves an additional reference MLM to more accurately quantify the privacy risks of memorization in MLMs. We show that masked language models are extremely susceptible to likelihood ratio membership inference attacks: Our empirical results, on models trained on medical notes, show that our attack improves the AUC of prior membership inference attacks from 0.66 to an alarmingly high 0.90 level, with a significant improvement in the low-error region: at 1% false positive rate, our attack is 51X more powerful than prior work.  ( 2 min )
    Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations. (arXiv:2203.03457v2 [cs.LG] UPDATED)
    In this paper, we will evaluate the performance of graph neural networks in two distinct domains: computer vision and reinforcement learning. In the computer vision section, we seek to learn whether a novel non-redundant representation for images as graphs can improve performance over trivial pixel to node mapping on a graph-level prediction graph, specifically image classification. For the reinforcement learning section, we seek to learn if explicitly modeling solving a Rubik's cube as a graph problem can improve performance over a standard model-free technique with no inductive bias.  ( 2 min )
    Optimum-statistical Collaboration Towards General and Efficient Black-box Optimization. (arXiv:2106.09215v3 [stat.ML] UPDATED)
    In this paper, we make the key delineation on the roles of resolution and statistical uncertainty in hierarchical bandits-based black-box optimization algorithms, guiding a more general analysis and a more efficient algorithm design. We introduce the \textit{optimum-statistical collaboration}, an algorithm framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. We provide a general analysis of this framework without specifying the forms of statistical error and uncertainty quantifier. Our framework and its analysis, due to their generality, can be applied to a large family of functions and partitions that satisfy different local smoothness assumptions and have different numbers of local optimums, which is much richer than the class of functions studied in prior works. Our framework also inspires us to propose a better measure of the statistical uncertainty and consequently a variance-adaptive algorithm \texttt{VHCT}. In theory, we prove the algorithm enjoys rate-optimal regret bounds under different local smoothness assumptions; in experiments, we show the algorithm outperforms prior efforts in different settings.  ( 2 min )
    Graph Reinforcement Learning for Predictive Power Allocation to Mobile Users. (arXiv:2203.03906v1 [cs.LG])
    Allocating resources with future channels can save resource to ensure quality-of-service of video streaming. In this paper, we optimize predictive power allocation to minimize the energy consumed at distributed units (DUs) by using deep deterministic policy gradient (DDPG) to find optimal policy and predict average channel gains. To improve training efficiency, we resort to graph DDPG for exploiting two kinds of relational priors: (a) permutation equivariant (PE) and permutation invariant (PI) properties of policy function and action-value function, (b) topology relation among users and DUs. To design graph DDPG framework more systematically in harnessing the priors, we first demonstrate how to transform matrix-based DDPG into graph-based DDPG. Then, we respectively design the actor and critic networks to satisfy the permutation properties when graph neural networks are used in embedding and end to-end manners. To avoid destroying the PE/PI properties of the actor and critic networks, we conceive a batch normalization method. Finally, we show the impact of leveraging each prior. Simulation results show that the learned predictive policy performs close to the optimal solution with perfect future information, and the graph DDPG algorithms converge much faster than existing DDPG algorithms.  ( 2 min )
    Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting. (arXiv:2110.10380v2 [cs.LG] UPDATED)
    Traffic forecasting is a challenging problem due to complex road networks and sudden speed changes caused by various events on roads. A number of models have been proposed to solve this challenging problem with a focus on learning spatio-temporal dependencies of roads. In this work, we propose a new perspective of converting the forecasting problem into a pattern matching task, assuming that large data can be represented by a set of patterns. To evaluate the validness of the new perspective, we design a novel traffic forecasting model, called Pattern-Matching Memory Networks (PM-MemNet), which learns to match input data to the representative patterns with a key-value memory structure. We first extract and cluster representative traffic patterns, which serve as keys in the memory. Then via matching the extracted keys and inputs, PM-MemNet acquires necessary information of existing traffic patterns from the memory and uses it for forecasting. To model spatio-temporal correlation of traffic, we proposed novel memory architecture GCMem, which integrates attention and graph convolution for memory enhancement. The experiment results indicate that PM-MemNet is more accurate than state-of-the-art models, such as Graph WaveNet with higher responsiveness. We also present a qualitative analysis result, describing how PM-MemNet works and achieves its higher accuracy when road speed rapidly changes.  ( 2 min )
    Flat minima generalize for low-rank matrix recovery. (arXiv:2203.03756v1 [cs.LG])
    Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We complete the paper with synthetic experiments that illustrate our findings.  ( 2 min )
    Open-Ended Knowledge Tracing. (arXiv:2203.03716v1 [cs.CY])
    Knowledge tracing refers to the problem of estimating each student's knowledge component/skill mastery level from their past responses to questions in educational applications. One direct benefit knowledge tracing methods provide is the ability to predict each student's performance on the future questions. However, one key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether the responses are correct or incorrect. Response correctness analysis/prediction is easy to navigate but loses important information, especially for open-ended questions: the exact student responses can potentially provide much more information about their knowledge states than only response correctness. In this paper, we present our first exploration into open-ended knowledge tracing, i.e., the analysis and prediction of students' open-ended responses to questions in the knowledge tracing setup. We first lay out a generic framework for open-ended knowledge tracing before detailing its application to the domain of computer science education with programming questions. We define a series of evaluation metrics in this domain and conduct a series of quantitative and qualitative experiments to test the boundaries of open-ended knowledge tracing methods on a real-world student code dataset.  ( 2 min )
    MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification. (arXiv:2112.04585v2 [cs.CV] UPDATED)
    We propose MASTAF, a Model-Agnostic Spatio-Temporal Attention Fusion network for few-shot video classification. MASTAF takes input from a general video spatial and temporal representation,e.g., using 2D CNN, 3D CNN, and video Transformer. Then, to make the most of such representations, we use self- and cross-attention models to highlight the critical spatio-temporal region to increase the inter-class distance and decrease the intra-class distance. Last, MASTAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot video classification benchmarks(UCF101, HMDB51, and Something-Something-V2), e.g., by up to 91.6%, 69.5%, and 60.7% for five-way one-shot video classification, respectively.  ( 2 min )
    Prune Your Model Before Distill It. (arXiv:2109.14960v2 [cs.LG] UPDATED)
    Knowledge distillation transfers the knowledge from a cumbersome teacher to a small student. Recent results suggest that the student-friendly teacher is more appropriate to distill since it provides more transferrable knowledge. In this work, we propose the novel framework, "prune, then distill," that prunes the model first to make it more transferrable and then distill it to the student. We provide several exploratory examples where the pruned teacher teaches better than the original unpruned networks. We further show theoretically that the pruned teacher plays the role of regularizer in distillation, which reduces the generalization error. Based on this result, we propose an end-to-end neural network compression scheme where the student network is formed based on the pruned teacher and then apply the "prune, then distill" strategy.  ( 2 min )
    Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions. (arXiv:2203.03771v1 [cs.LG])
    The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a "static" setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and "learns to execute" descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.  ( 2 min )
    Leveraging Initial Hints for Free in Stochastic Linear Bandits. (arXiv:2203.04274v1 [cs.LG])
    We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.  ( 2 min )
    Exploring single-song autoencoding schemes for audio-based music structure analysis. (arXiv:2110.14437v2 [cs.SD] UPDATED)
    The ability of deep neural networks to learn complex data relations and representations is established nowadays, but it generally relies on large sets of training data. This work explores a "piece-specific" autoencoding scheme, in which a low-dimensional autoencoder is trained to learn a latent/compressed representation specific to a given song, which can then be used to infer the song structure. Such a model does not rely on supervision nor annotations, which are well-known to be tedious to collect and often ambiguous in Music Structure Analysis. We report that the proposed unsupervised auto-encoding scheme achieves the level of performance of supervised state-of-the-art methods with 3 seconds tolerance when using a Log Mel spectrogram representation on the RWC-Pop dataset.  ( 2 min )
    Improving Sequential Latent Variable Models with Autoregressive Flows. (arXiv:2010.03172v2 [cs.LG] UPDATED)
    We propose an approach for improving sequence modeling based on autoregressive normalizing flows. Each autoregressive transform, acting across time, serves as a moving frame of reference, removing temporal correlations, and simplifying the modeling of higher-level dynamics. This technique provides a simple, general-purpose method for improving sequence modeling, with connections to existing and classical techniques. We demonstrate the proposed approach both with standalone flow-based models and as a component within sequential latent variable models. Results are presented on three benchmark video datasets, where autoregressive flow-based dynamics improve log-likelihood performance over baseline models. Finally, we illustrate the decorrelation and improved generalization properties of using flow-based dynamics.  ( 2 min )
    SphereFace Revived: Unifying Hyperspherical Face Recognition. (arXiv:2109.05565v2 [cs.CV] UPDATED)
    This paper addresses the deep face recognition problem under an open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. To this end, hyperspherical face recognition, as a promising line of research, has attracted increasing attention and gradually become a major focus in face recognition research. As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin. However, SphereFace still suffers from severe training instability which limits its application in practice. In order to address this problem, we introduce a unified framework to understand large angular margin in hyperspherical face recognition. Under this framework, we extend the study of SphereFace and propose an improved variant with substantially better training stability -- SphereFace-R. Specifically, we propose two novel ways to implement the multiplicative margin, and study SphereFace-R under three different feature normalization schemes (no feature normalization, hard feature normalization and soft feature normalization). We also propose an implementation strategy -- "characteristic gradient detachment" -- to stabilize training. Extensive experiments on SphereFace-R show that it is consistently better than or competitive with state-of-the-art methods.  ( 2 min )
    The role of alcohol outlet visits derived from mobile phone location data in enhancing domestic violence prediction at the neighborhood level. (arXiv:2203.04088v1 [cs.CY])
    Domestic violence (DV) is a serious public health issue, with 1 in 3 women and 1 in 4 men experiencing some form of partner-related violence every year. Existing research has shown a strong association between alcohol use and DV at the individual level. Accordingly, alcohol use could also be a predictor for DV at the neighborhood level, helping identify the neighborhoods where DV is more likely to happen. However, it is difficult and costly to collect data that can represent neighborhood-level alcohol use especially for a large geographic area. In this study, we propose to derive information about the alcohol outlet visits of the residents of different neighborhoods from anonymized mobile phone location data, and investigate whether the derived visits can help better predict DV at the neighborhood level. We use mobile phone data from the company SafeGraph, which is freely available to researchers and which contains information about how people visit various points-of-interest including alcohol outlets. In such data, a visit to an alcohol outlet is identified based on the GPS point location of the mobile phone and the building footprint (a polygon) of the alcohol outlet. We present our method for deriving neighborhood-level alcohol outlet visits, and experiment with four different statistical and machine learning models to investigate the role of the derived visits in enhancing DV prediction based on an empirical dataset about DV in Chicago. Our results reveal the effectiveness of the derived alcohol outlets visits in helping identify neighborhoods that are more likely to suffer from DV, and can inform policies related to DV intervention and alcohol outlet licensing.  ( 3 min )
    Nonconvex Matrix Completion with Linearly Parameterized Factors. (arXiv:2003.13153v2 [math.ST] UPDATED)
    Techniques of matrix completion aim to impute a large portion of missing entries in a data matrix through a small portion of observed ones. In practice including collaborative filtering, prior information and special structures are usually employed in order to improve the accuracy of matrix completion. In this paper, we propose a unified nonconvex optimization framework for matrix completion with linearly parameterized factors. In particular, by introducing a condition referred to as Correlated Parametric Factorization, we can conduct a unified geometric analysis for the nonconvex objective by establishing uniform upper bounds for low-rank estimation resulting from any local minimum. Perhaps surprisingly, the condition of Correlated Parametric Factorization holds for important examples including subspace-constrained matrix completion and skew-symmetric matrix completion. The effectiveness of our unified nonconvex optimization method is also empirically illustrated by extensive numerical simulations.
    Positional Encoder Graph Neural Networks for Geographic Data. (arXiv:2111.10144v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, in the absence of further context on the geometric structure of the data, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). In this paper, we propose PE-GNN, a new framework that incorporates spatial context and correlation explicitly into the models. Building on recent advances in geospatial auxiliary task learning and semantic spatial embeddings, our proposed method (1) learns a context-aware vector encoding of the geographic coordinates and (2) predicts spatial autocorrelation in the data in parallel with the main task. On spatial regression tasks, we show the effectiveness of our approach, improving performance over different state-of-the-art GNN approaches. We also test our approach for spatial interpolation, i.e., spatial regression without node features, a task that GNNs are currently not competitive at. We observe that our approach not only vastly improves over the GNN baselines, but can match Gaussian processes, the most commonly utilized method for spatial interpolation problems. The code for this study can be accessed via: https://github.com/konstantinklemmer/pe-gnn
    ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models. (arXiv:2106.12248v3 [cs.LG] UPDATED)
    Frequently, population studies feature pyramidally-organized data represented using Hierarchical Bayesian Models (HBM) enriched with plates. These models can become prohibitively large in settings such as neuroimaging, where a sample is composed of a functional MRI signal measured on 300 brain locations, across 4 measurement sessions, and 30 subjects, resulting in around 1 million latent parameters.Such high dimensionality hampers the usage of modern, expressive flow-based techniques.To infer parameter posterior distributions in this challenging class of problems, we designed a novel methodology that automatically produces a variational family dual to a target HBM. This variational family, represented as a neural network, consists in the combination of an attention-based hierarchical encoder feeding summary statistics to a set of normalizing flows. Our automatically-derived neural network exploits exchangeability in the plate-enriched HBM and factorizes its parameter space. The resulting architecture reduces by orders of magnitude its parameterization with respect to that of a typical flow-based representation, while maintaining expressivity.Our method performs inference on the specified HBM in an amortized setup: once trained, it can readily be applied to a new data sample to compute the parameters' full posterior.We demonstrate the capability and scalability of our method on simulated data, as well as a challenging high-dimensional brain parcellation experiment. We also open up several questions that lie at the intersection between normalizing flows, SBI, structured Variational Inference, and inference amortization.
    Comparing representations of biological data learned with different AI paradigms, augmenting and cropping strategies. (arXiv:2203.04107v1 [eess.IV])
    Recent advances in computer vision and robotics enabled automated large-scale biological image analysis. Various machine learning approaches have been successfully applied to phenotypic profiling. However, it remains unclear how they compare in terms of biological feature extraction. In this study, we propose a simple CNN architecture and implement 4 different representation learning approaches. We train 16 deep learning setups on the 770k cancer cell images dataset under identical conditions, using different augmenting and cropping strategies. We compare the learned representations by evaluating multiple metrics for each of three downstream tasks: i) distance-based similarity analysis of known drugs, ii) classification of drugs versus controls, iii) clustering within cell lines. We also compare training times and memory usage. Among all tested setups, multi-crops and random augmentations generally improved performance across tasks, as expected. Strikingly, self-supervised (implicit contrastive learning) models showed competitive performance being up to 11 times faster to train. Self-supervised regularized learning required the most of memory and computation to deliver arguably the most informative features. We observe that no single combination of augmenting and cropping strategies consistently results in top performance across tasks and recommend prospective research directions.
    AdaPT: Fast Emulation of Approximate DNN Accelerators in PyTorch. (arXiv:2203.04071v1 [cs.LG])
    Current state-of-the-art employs approximate multipliers to address the highly increased power demands of DNN accelerators. However, evaluating the accuracy of approximate DNNs is cumbersome due to the lack of adequate support for approximate arithmetic in DNN frameworks. We address this inefficiency by presenting AdaPT, a fast emulation framework that extends PyTorch to support approximate inference as well as approximation-aware retraining. AdaPT can be seamlessly deployed and is compatible with the most DNNs. We evaluate the framework on several DNN models and application fields including CNNs, LSTMs, and GANs for a number of approximate multipliers with distinct bitwidth values. The results show substantial error recovery from approximate re-training and reduced inference time up to 53.9x with respect to the baseline approximate implementation.  ( 2 min )
    Variational Nearest Neighbor Gaussian Processes. (arXiv:2202.01694v2 [cs.LG] UPDATED)
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.
    Covariate-Balancing-Aware Interpretable Deep Learning models for Treatment Effect Estimation. (arXiv:2203.03185v2 [stat.ML] UPDATED)
    Estimating treatment effects is of great importance for many biomedical applications with observational data. Particularly, interpretability of the treatment effects is preferable for many biomedical researchers. In this paper, we first give a theoretical analysis and propose an upper bound for the bias of average treatment effect estimation under the strong ignorability assumption. The proposed upper bound consists of two parts: training error for factual outcomes, and the distance between treated and control distributions. We use the Weighted Energy Distance (WED) to measure the distance between two distributions. Motivated by the theoretical analysis, we implement this upper bound as an objective function being minimized by leveraging a novel additive neural network architecture, which combines the expressivity of deep neural network, the interpretability of generalized additive model, the sufficiency of the balancing score for estimation adjustment, and covariate balancing properties of treated and control distributions, for estimating average treatment effects from observational data. Furthermore, we impose a so-called weighted regularization procedure based on non-parametric theory, to obtain some desirable asymptotic properties. The proposed method is illustrated by re-examining the benchmark datasets for causal inference, and it outperforms the state-of-art.
    DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos. (arXiv:2203.03996v1 [cs.CV])
    Convolutional neural network inference on video data requires powerful hardware for real-time processing. Given the inherent coherence across consecutive frames, large parts of a video typically change little. By skipping identical image regions and truncating insignificant pixel updates, computational redundancy can in theory be reduced significantly. However, these theoretical savings have been difficult to translate into practice, as sparse updates hamper computational consistency and memory access coherence; which are key for efficiency on real hardware. With DeltaCNN, we present a sparse convolutional neural network framework that enables sparse frame-by-frame updates to accelerate video inference in practice. We provide sparse implementations for all typical CNN layers and propagate sparse feature updates end-to-end - without accumulating errors over time. DeltaCNN is applicable to all convolutional neural networks without retraining. To the best of our knowledge, we are the first to significantly outperform the dense reference, cuDNN, in practical settings, achieving speedups of up to 7x with only marginal differences in accuracy.
    Estimating the average causal effect of intervention in continuous variables using machine learning. (arXiv:2203.03916v1 [stat.ML])
    The most widely discussed methods for estimating the Average Causal Effect / Average Treatment Effect are those for intervention in discrete binary variables whose value represents the intervention / non-intervention groups. On the other hand, methods for intervening in continuous variables independent of the data generating model has not been developed. In this study, we give a method for estimating the average causal effect for intervention in continuous variables that can be applied to data of any generating model as long as the causal effect is identifiable. The proposing method is independent of machine learning algorithms and preserves the identifiability of the data.
    GIFAIR-FL: A Framework for Group and Individual Fairness in Federated Learning. (arXiv:2108.02741v2 [cs.LG] UPDATED)
    In this paper we propose \texttt{GIFAIR-FL}: a framework that imposes \textbf{G}roup and \textbf{I}ndividual \textbf{FAIR}ness to \textbf{F}ederated \textbf{L}earning settings. By adding a regularization term, our algorithm penalizes the spread in the loss of client groups to drive the optimizer to fair solutions. Our framework \texttt{GIFAIR-FL} can accommodate both global and personalized settings. Theoretically, we show convergence in non-convex and strongly convex settings. Our convergence guarantees hold for both $i.i.d.$ and non-$i.i.d.$ data. To demonstrate the empirical performance of our algorithm, we apply our method to image classification and text prediction tasks. Compared to existing algorithms, our method shows improved fairness results while retaining superior or similar prediction accuracy.
    Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation. (arXiv:2203.04192v1 [cs.LG])
    Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.
    Hypernetwork Dismantling via Deep Reinforcement Learning. (arXiv:2104.14332v2 [cs.LG] UPDATED)
    Network dismantling aims to degrade the connectivity of a network by removing an optimal set of nodes. It has been widely adopted in many real-world applications such as epidemic control and rumor containment. However, conventional methods usually focus on simple network modeling with only pairwise interactions, while group-wise interactions modeled by hypernetwork are ubiquitous and critical. In this work, we formulate the hypernetwork dismantling problem as a node sequence decision problem and propose a deep reinforcement learning (DRL)-based hypernetwork dismantling framework. Besides, we design a novel inductive hypernetwork embedding method to ensure the transferability to various real-world hypernetworks. Our framework first generates small-scale synthetic hypernetworks and embeds the nodes and hypernetworks into a low dimensional vector space to represent the action and state space in DRL, respectively. Then trial-and-error dismantling tasks are conducted by an agent on these synthetic hypernetworks, and the dismantling strategy is continuously optimized. Finally, the well-optimized strategy is applied to real-world hypernetwork dismantling tasks. Experimental results on five real-world hypernetworks demonstrate the effectiveness of our proposed framework.
    Measuring Fairness under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v2 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably impacting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, i.e., ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. In order to reach this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it may be hard to measure the group fairness of trained models, even from within the companies developing them. In this work we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we adapt several quantification methods to measure demographic parity, and test them under five different experimental protocols corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    COLA: Consistent Learning with Opponent-Learning Awareness. (arXiv:2203.04098v1 [cs.LG])
    Learning in general-sum games can be unstable and often leads to socially undesirable, Pareto-dominated outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for the agent's influence on the anticipated learning steps of other agents. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA's failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency problem if it converges. Second, we correct a claim made in the literature, by proving that, contrary to Sch\"afer and Anandkumar (2019), Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion. Hence, CGD also does not solve the consistency problem. Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA's inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.
    Variational methods for simulation-based inference. (arXiv:2203.04176v1 [stat.ML])
    We present Sequential Neural Variational Inference (SNVI), an approach to perform Bayesian inference in models with intractable likelihoods. SNVI combines likelihood-estimation (or likelihood-ratio-estimation) with variational inference to achieve a scalable simulation-based inference approach. SNVI maintains the flexibility of likelihood(-ratio) estimation to allow arbitrary proposals for simulations, while simultaneously providing a functional estimate of the posterior distribution without requiring MCMC sampling. We present several variants of SNVI and demonstrate that they are substantially more computationally efficient than previous algorithms, without loss of accuracy on benchmark tasks. We apply SNVI to a neuroscience model of the pyloric network in the crab and demonstrate that it can infer the posterior distribution with one order of magnitude fewer simulations than previously reported. SNVI vastly reduces the computational cost of simulation-based inference while maintaining accuracy and flexibility, making it possible to tackle problems that were previously inaccessible.  ( 2 min )
    Confidence intervals for the random forest generalization error. (arXiv:2112.06101v2 [stat.ML] UPDATED)
    We show that the byproducts of the standard training process of a random forest yield not only the well known and almost computationally free out-of-bag point estimate of the model generalization error, but also give a direct path to compute confidence intervals for the generalization error which avoids processes of data splitting and model retraining. Besides the low computational cost involved in their construction, these confidence intervals are shown through simulations to have good coverage and appropriate shrinking rate of their width in terms of the training sample size.
    Hierarchical Potential-based Reward Shaping from Task Specifications. (arXiv:2110.02792v2 [cs.LG] UPDATED)
    The automatic synthesis of autonomous agents' policies through reinforcement learning relies on the definition of a reward signal that simultaneously captures many, possibly conflicting, requirements of a task specification. In this work, we introduce a novel, hierarchical, potential-based reward-shaping approach (HPRS) for defining effective rewards for a large family of problems. We formalize a task as a set of target, safety, and comfort requirements, and define an automated methodology to shape the reward signal by considering a natural priority among them. Building on top of potential-based shaping, we show that HPRS does not alter the optimal policy. Our experimental evaluation demonstrates the superiority of HPRS in capturing the intended behavior, resulting in task-satisfying policies with improved comfort, and converging to optimal policies faster than other state-of-the-art approaches.
    Noisy Low-rank Matrix Optimization: Geometry of Local Minima and Convergence Rate. (arXiv:2203.03899v1 [math.OC])
    This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sense has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results are not able to handle the problem with a general objective function subject to noisy data. In this paper, we address this problem by developing a mathematical framework that can deal with random corruptions to general objective functions, where the noise model is arbitrary. We prove that as long as the RIP constant of the noiseless objective is less than $1/3$, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than $1/3$. This paper offers the first set of results on the global and local optimization landscapes of general low-rank optimization problems under arbitrary random corruptions.
    Informative Planning for Worst-Case Error Minimisation in Sparse Gaussian Process Regression. (arXiv:2203.03828v1 [cs.RO])
    We present a planning framework for minimising the deterministic worst-case error in sparse Gaussian process (GP) regression. We first derive a universal worst-case error bound for sparse GP regression with bounded noise using interpolation theory on reproducing kernel Hilbert spaces (RKHSs). By exploiting the conditional independence (CI) assumption central to sparse GP regression, we show that the worst-case error minimisation can be achieved by solving a posterior entropy minimisation problem. In turn, the posterior entropy minimisation problem is solved using a Gaussian belief space planning algorithm. We corroborate the proposed worst-case error bound in a simple 1D example, and test the planning framework in simulation for a 2D vehicle in a complex flow field. Our results demonstrate that the proposed posterior entropy minimisation approach is effective in minimising deterministic error, and outperforms the conventional measurement entropy maximisation formulation when the inducing points are fixed.  ( 2 min )
    Conditional entropy minimization principle for learning domain invariant representation features. (arXiv:2201.10460v2 [cs.LG] UPDATED)
    Invariance principle-based methods, for example, Invariant Risk Minimization (IRM), have recently emerged as promising approaches for Domain Generalization (DG). Despite the promising theory, invariance principle-based approaches fail in common classification tasks due to the mixture of the true invariant features and the spurious invariant features. In this paper, we propose a framework based on the conditional entropy minimization principle to filter out the spurious invariant features leading to a new algorithm with a better generalization capability. We theoretically prove that under some particular assumptions, the representation function can precisely recover the true invariant features. In addition, we also show that the proposed approach is closely related to the well-known Information Bottleneck framework. Both the theoretical and numerical results are provided to justify our approach.
    Graph-based Reinforcement Learning meets Mixed Integer Programs: An application to 3D robot assembly discovery. (arXiv:2203.04120v1 [cs.RO])
    Robot assembly discovery is a challenging problem that lives at the intersection of resource allocation and motion planning. The goal is to combine a predefined set of objects to form something new while considering task execution with the robot-in-the-loop. In this work, we tackle the problem of building arbitrary, predefined target structures entirely from scratch using a set of Tetris-like building blocks and a robotic manipulator. Our novel hierarchical approach aims at efficiently decomposing the overall task into three feasible levels that benefit mutually from each other. On the high level, we run a classical mixed-integer program for global optimization of block-type selection and the blocks' final poses to recreate the desired shape. Its output is then exploited to efficiently guide the exploration of an underlying reinforcement learning (RL) policy. This RL policy draws its generalization properties from a flexible graph-based representation that is learned through Q-learning and can be refined with search. Moreover, it accounts for the necessary conditions of structural stability and robotic feasibility that cannot be effectively reflected in the previous layer. Lastly, a grasp and motion planner transforms the desired assembly commands into robot joint movements. We demonstrate the performance of the proposed method on a set of competitive simulated robot assembly discovery environments and report performance and robustness gains compared to an unstructured end-to-end approach. Videos are available at https://sites.google.com/view/rl-meets-milp .  ( 2 min )
    A Fast Scale-Invariant Algorithm for Non-negative Least Squares with Non-negative Data. (arXiv:2203.03808v1 [math.OC])
    Nonnegative (linear) least square problems are a fundamental class of problems that is well-studied in statistical learning and for which solvers have been implemented in many of the standard programming languages used within the machine learning community. The existing off-the-shelf solvers view the non-negativity constraint in these problems as an obstacle and, compared to unconstrained least squares, perform additional effort to address it. However, in many of the typical applications, the data itself is nonnegative as well, and we show that the nonnegativity in this case makes the problem easier. In particular, while the oracle complexity of unconstrained least squares problems necessarily scales with one of the data matrix constants (typically the spectral norm) and these problems are solved to additive error, we show that nonnegative least squares problems with nonnegative data are solvable to multiplicative error and with complexity that is independent of any matrix constants. The algorithm we introduce is accelerated and based on a primal-dual perspective. We further show how to provably obtain linear convergence using adaptive restart coupled with our method and demonstrate its effectiveness on large-scale data via numerical experiments.  ( 2 min )
    Towards Efficient Data-Centric Robust Machine Learning with Noise-based Augmentation. (arXiv:2203.03810v1 [cs.LG])
    The data-centric machine learning aims to find effective ways to build appropriate datasets which can improve the performance of AI models. In this paper, we mainly focus on designing an efficient data-centric scheme to improve robustness for models towards unforeseen malicious inputs in the black-box test settings. Specifically, we introduce a noised-based data augmentation method which is composed of Gaussian Noise, Salt-and-Pepper noise, and the PGD adversarial perturbations. The proposed method is built on lightweight algorithms and proved highly effective based on comprehensive evaluations, showing good efficiency on computation cost and robustness enhancement. In addition, we share our insights about the data-centric robust machine learning gained from our experiments.  ( 2 min )
    Learn to Match with No Regret: Reinforcement Learning in Markov Matching Markets. (arXiv:2203.03684v1 [cs.LG])
    We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market. At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. Such a setting captures a range of applications including ridesharing platforms. We formalize the problem by proposing a reinforcement learning framework that integrates optimistic value iteration with maximum weight matching. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret.
    An Analysis of Measure-Valued Derivatives for Policy Gradients. (arXiv:2203.03917v1 [cs.LG])
    Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator - the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces. With this work, we want to show that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators.  ( 2 min )
    Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers. (arXiv:2203.03814v1 [eess.IV])
    We introduce AiD Regen, a novel system that generates 3D wound models combining 2D semantic segmentation with 3D reconstruction so that they can be printed via 3D bio-printers during the surgery to treat diabetic foot ulcers (DFUs). AiD Regen seamlessly binds the full pipeline, which includes RGB-D image capturing, semantic segmentation, boundary-guided point-cloud processing, 3D model reconstruction, and 3D printable G-code generation, into a single system that can be used out of the box. We developed a multi-stage data preprocessing method to handle small and unbalanced DFU image datasets. AiD Regen's human-in-the-loop machine learning interface enables clinicians to not only create 3D regenerative patches with just a few touch interactions but also customize and confirm wound boundaries. As evidenced by our experiments, our model outperforms prior wound segmentation models and our reconstruction algorithm is capable of generating 3D wound models with compelling accuracy. We further conducted a case study on a real DFU patient and demonstrated the effectiveness of AiD Regen in treating DFU wounds.
    Semi-Random Sparse Recovery in Nearly-Linear Time. (arXiv:2203.04002v1 [cs.DS])
    Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + \xi$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time. Our approach differs from prior fast iterative methods with provable guarantees under semi-random generative models: natural conditions on a submatrix which make sparse recovery tractable are NP-hard to verify. We design a new iterative method tailored to the geometry of sparse recovery which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for natural statistical inverse problems.
    On the Initialization for Convex-Concave Min-max Problems. (arXiv:2103.00284v2 [math.OC] UPDATED)
    Convex-concave min-max problems are ubiquitous in machine learning, and people usually utilize first-order methods (e.g., gradient descent ascent) to find the optimal solution. One feature which separates convex-concave min-max problems from convex minimization problems is that the best known convergence rates for min-max problems have an explicit dependence on the size of the domain, rather than on the distance between initial point and the optimal solution. This means that the convergence speed does not have any improvement even if the algorithm starts from the optimal solution, and hence, is oblivious to the initialization. Here, we show that strict-convexity-strict-concavity is sufficient to get the convergence rate to depend on the initialization. We also show how different algorithms can asymptotically achieve initialization-dependent convergence rates on this class of functions. Furthermore, we show that the so-called "parameter-free" algorithms allow to achieve improved initialization-dependent asymptotic rates without any learning rate to tune. In addition, we utilize this particular parameter-free algorithm as a subroutine to design a new algorithm, which achieves a novel non-asymptotic fast rate for strictly-convex-strictly-concave min-max problems with a growth condition and H{\"o}lder continuous solution mapping. Experiments are conducted to verify our theoretical findings and demonstrate the effectiveness of the proposed algorithms.
    Forward Looking Best-Response Multiplicative Weights Update Methods for Bilinear Zero-sum Games. (arXiv:2106.03579v3 [cs.GT] UPDATED)
    Our work focuses on extra gradient learning algorithms for finding Nash equilibria in bilinear zero-sum games. The proposed method, which can be formally considered as a variant of Optimistic Mirror Descent \cite{DBLP:conf/iclr/MertikopoulosLZ19}, uses a large learning rate for the intermediate gradient step which essentially leads to computing (approximate) best response strategies against the profile of the previous iteration. Although counter-intuitive at first sight due to the irrationally large, for an iterative algorithm, intermediate learning step, we prove that the method guarantees last-iterate convergence to an equilibrium. Particularly, we show that the algorithm reaches first an $\eta^{1/\rho}$-approximate Nash equilibrium, with $\rho > 1$, by decreasing the Kullback-Leibler divergence of each iterate by at least $\Omega(\eta^{1+\frac{1}{\rho}})$, for sufficiently small learning rate, $\eta$, until the method becomes a contracting map, and converges to the exact equilibrium. Furthermore, we perform experimental comparisons with the optimistic variant of the multiplicative weights update method, by \cite{Daskalakis2019LastIterateCZ} and show that our algorithm has significant practical potential since it offers substantial gains in terms of accelerated convergence.
    Robustly-reliable learners under poisoning attacks. (arXiv:2203.04160v1 [cs.LG])
    Data poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we show how to achieve strong robustness guarantees in the face of such attacks across multiple axes. We provide robustly-reliable predictions, in which the predicted label is guaranteed to be correct so long as the adversary has not exceeded a given corruption budget, even in the presence of instance targeted attacks, where the adversary knows the test example in advance and aims to cause a specific failure on that example. Our guarantees are substantially stronger than those in prior approaches, which were only able to provide certificates that the prediction of the learning algorithm does not change, as opposed to certifying that the prediction is correct, as we are able to achieve in our work. Remarkably, we provide a complete characterization of learnability in this setting, in particular, nearly-tight matching upper and lower bounds on the region that can be certified, as well as efficient algorithms for computing this region given an ERM oracle. Moreover, for the case of linear separators over logconcave distributions, we provide efficient truly polynomial time algorithms (i.e., non-oracle algorithms) for such robustly-reliable predictions. We also extend these results to the active setting where the algorithm adaptively asks for labels of specific informative examples, and the difficulty is that the adversary might even be adaptive to this interaction, as well as to the agnostic learning setting where there is no perfect classifier even over the uncorrupted data.
    The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning. (arXiv:2203.03761v1 [cs.LG])
    We consider the problem of training a $d$ dimensional model with distributed differential privacy (DP) where secure aggregation (SecAgg) is used to ensure that the server only sees the noisy sum of $n$ model updates in every training round. Taking into account the constraints imposed by SecAgg, we characterize the fundamental communication cost required to obtain the best accuracy achievable under $\varepsilon$ central DP (i.e. under a fully trusted server and no communication constraints). Our results show that $\tilde{O}\left( \min(n^2\varepsilon^2, d) \right)$ bits per client are both sufficient and necessary, and this fundamental limit can be achieved by a linear scheme based on sparse random projections. This provides a significant improvement relative to state-of-the-art SecAgg distributed DP schemes which use $\tilde{O}(d\log(d/\varepsilon^2))$ bits per client. Empirically, we evaluate our proposed scheme on real-world federated learning tasks. We find that our theoretical analysis is well matched in practice. In particular, we show that we can reduce the communication cost significantly to under $1.2$ bits per parameter in realistic privacy settings without decreasing test-time performance. Our work hence theoretically and empirically specifies the fundamental price of using SecAgg.
    Adversarial Attacks in Cooperative AI. (arXiv:2111.14833v3 [cs.LG] UPDATED)
    Single-agent reinforcement learning algorithms in a multi-agent environment are inadequate for fostering cooperation. If intelligent agents are to interact and work together to solve complex problems, methods that counter non-cooperative behavior are needed to facilitate the training of multiple agents. This is the goal of cooperative AI. Recent research in adversarial machine learning, however, shows that models (e.g., image classifiers) can be easily deceived into making inferior decisions. Meanwhile, an important line of research in cooperative AI has focused on introducing algorithmic improvements that accelerate learning of optimally cooperative behavior. We argue that prominent methods of cooperative AI are exposed to weaknesses analogous to those studied in prior machine learning research. More specifically, we show that three algorithms inspired by human-like social intelligence are, in principle, vulnerable to attacks that exploit weaknesses introduced by cooperative AI's algorithmic improvements and report experimental findings that illustrate how these vulnerabilities can be exploited in practice.
    Policy-Based Bayesian Experimental Design for Non-Differentiable Implicit Models. (arXiv:2203.04272v1 [cs.LG])
    For applications in healthcare, physics, energy, robotics, and many other fields, designing maximally informative experiments is valuable, particularly when experiments are expensive, time-consuming, or pose safety hazards. While existing approaches can sequentially design experiments based on prior observation history, many of these methods do not extend to implicit models, where simulation is possible but computing the likelihood is intractable. Furthermore, they often require either significant online computation during deployment or a differentiable simulation system. We introduce Reinforcement Learning for Deep Adaptive Design (RL-DAD), a method for simulation-based optimal experimental design for non-differentiable implicit models. RL-DAD extends prior work in policy-based Bayesian Optimal Experimental Design (BOED) by reformulating it as a Markov Decision Process with a reward function based on likelihood-free information lower bounds, which is used to learn a policy via deep reinforcement learning. The learned design policy maps prior histories to experiment designs offline and can be quickly deployed during online execution. We evaluate RL-DAD and find that it performs competitively with baselines on three benchmarks.
    Digital Speech Algorithms for Speaker De-Identification. (arXiv:2203.03932v1 [cs.SD])
    The present work is based on the COST Action IC1206 for De-identification in multimedia content. It was performed to test four algorithms of voice modifications on a speech gender recognizer to find the degree of modification of pitch when the speech recognizer have the probability of success equal to the probability of failure. The purpose of this analysis is to assess the intensity of the speech tone modification, the quality, the reversibility and not-reversibility of the changes made.  ( 2 min )
    A New Era: Intelligent Tutoring Systems Will Transform Online Learning for Millions. (arXiv:2203.03724v1 [cs.CY])
    Despite artificial intelligence (AI) having transformed major aspects of our society, less than a fraction of its potential has been explored, let alone deployed, for education. AI-powered learning can provide millions of learners with a highly personalized, active and practical learning experience, which is key to successful learning. This is especially relevant in the context of online learning platforms. In this paper, we present the results of a comparative head-to-head study on learning outcomes for two popular online learning platforms (n=199 participants): A MOOC platform following a traditional model delivering content using lecture videos and multiple-choice quizzes, and the Korbit learning platform providing a highly personalized, active and practical learning experience. We observe a huge and statistically significant increase in the learning outcomes, with students on the Korbit platform providing full feedback resulting in higher course completion rates and achieving learning gains 2 to 2.5 times higher than both students on the MOOC platform and students in a control group who don't receive personalized feedback on the Korbit platform. The results demonstrate the tremendous impact that can be achieved with a personalized, active learning AI-powered system. Making this technology and learning experience available to millions of learners around the world will represent a significant leap forward towards the democratization of education.
    A Review of the Gumbel-max Trick and its Extensions for Discrete Stochasticity in Machine Learning. (arXiv:2110.01515v2 [cs.LG] UPDATED)
    The Gumbel-max trick is a method to draw a sample from a categorical distribution, given by its unnormalized (log-)probabilities. Over the past years, the machine learning community has proposed several extensions of this trick to facilitate, e.g., drawing multiple samples, sampling from structured domains, or gradient estimation for error backpropagation in neural network optimization. The goal of this survey article is to present background about the Gumbel-max trick, and to provide a structured overview of its extensions to ease algorithm selection. Moreover, it presents a comprehensive outline of (machine learning) literature in which Gumbel-based algorithms have been leveraged, reviews commonly-made design choices, and sketches a future perspective.  ( 2 min )
    How Well Do Sparse Imagenet Models Transfer?. (arXiv:2111.13445v4 [cs.CV] UPDATED)
    Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream," specialized datasets. Generally, it is understood that more accurate models on the "upstream" dataset will provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned - that is, compressed by sparsifiying their connections. Specifically, we consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.  ( 2 min )
    Cognitive Diagnosis with Explicit Student Vector Estimation and Unsupervised Question Matrix Learning. (arXiv:2203.03722v1 [cs.CY])
    Cognitive diagnosis is an essential task in many educational applications. Many solutions have been designed in the literature. The deterministic input, noisy "and" gate (DINA) model is a classical cognitive diagnosis model and can provide interpretable cognitive parameters, e.g., student vectors. However, the assumption of the probabilistic part of DINA is too strong, because it assumes that the slip and guess rates of questions are student-independent. Besides, the question matrix (i.e., Q-matrix) recording the skill distribution of the questions in the cognitive diagnosis domain often requires precise labels given by domain experts. Thus, we propose an explicit student vector estimation (ESVE) method to estimate the student vectors of DINA with a local self-consistent test, which does not rely on any assumptions for the probabilistic part of DINA. Then, based on the estimated student vectors, the probabilistic part of DINA can be modified to a student dependent model that the slip and guess rates are related to student vectors. Furthermore, we propose an unsupervised method called heuristic bidirectional calibration algorithm (HBCA) to label the Q-matrix automatically, which connects the question difficulty relation and the answer results for initialization and uses the fault tolerance of ESVE-DINA for calibration. The experimental results on two real-world datasets show that ESVE-DINA outperforms the DINA model on accuracy and that the Q-matrix labeled automatically by HBCA can achieve performance comparable to that obtained with the manually labeled Q-matrix when using the same model structure.  ( 2 min )
    Data augmentation with mixtures of max-entropy transformations for filling-level classification. (arXiv:2203.04027v1 [cs.LG])
    We address the problem of distribution shifts in test-time data with a principled data augmentation scheme for the task of content-level classification. In such a task, properties such as shape or transparency of test-time containers (cup or drinking glass) may differ from those represented in the training data. Dealing with such distribution shifts using standard augmentation schemes is challenging and transforming the training images to cover the properties of the test-time instances requires sophisticated image manipulations. We therefore generate diverse augmentations using a family of max-entropy transformations that create samples with new shapes, colors and spectral characteristics. We show that such a principled augmentation scheme, alone, can replace current approaches that use transfer learning or can be used in combination with transfer learning to improve its performance.  ( 2 min )
    Seeing BDD100K in dark: Single-Stage Night-time Object Detection via Continual Fourier Contrastive Learning. (arXiv:2112.02891v2 [cs.CV] UPDATED)
    In this paper, we study the lesser explored avenue of object detection at night-time. An object detector trained on abundant labeled daytime images often fails to perform well on night images, due to domain gap. As collecting more labeled data from night-time is expensive, unpaired generative image translation techniques seek to synthesize night-time images. However, unrealistic artifacts often arise on the synthetic images. Illuminating night-time inference images also does not work well in practice, as shown in our paper. To address these issues, we suggest a novel technique for enhancing the object detector via Contrastive Learning, which tries to group together embeddings of similar images. To provide anchor-positive image pairs for Contrastive Learning, we leverage Fourier Transformation, which is naturally good at preserving the semantics of an image. For practical benefits in real-time applications, we choose the recently proposed YOLOF single-stage detector, which provides a simple and clean encoder-decoder segregation of the detector network. However, merely trying to teach the encoder to perform well on the auxiliary Contrastive Learning task may lead to catastrophic forgetting of the knowledge essential for object detection. Hence, we train the encoder in a Continual Learning fashion. Our novel method by an elegant training framework achieves state-of-the-art performance on the large scale BDD100K dataset, in an uniform setting, chosen, to the best of our knowledge, for the first time.  ( 2 min )
    Biometric recognition: why not massively adopted yet?. (arXiv:2203.03719v1 [cs.CY])
    Although there has been a dramatically reduction on the prices of capturing devices and an increase on computing power in the last decade, it seems that biometric systems are still far from massive adoption for civilian applications. This paper deals with the causes of this phenomenon, as well as some misconceptions regarding biometric identification.  ( 2 min )
    Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders. (arXiv:2106.12271v2 [cs.SD] UPDATED)
    Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.  ( 2 min )
    Nonnegative Tucker Decomposition with Beta-divergence for Music Structure Analysis of audio signals. (arXiv:2110.14434v3 [cs.SD] UPDATED)
    Nonnegative Tucker Decomposition (NTD), a tensor decomposition model, has received increased interest in the recent years because of its ability to blindly extract meaningful patterns in tensor data. Nevertheless, existing algorithms to compute NTD are mostly designed for the Euclidean loss. On the other hand, NTD has recently proven to be a powerful tool in Music Information Retrieval. This work proposes a Multiplicative Updates algorithm to compute NTD with the beta-divergence loss, often considered a better loss for audio processing. We notably show how to implement efficiently the multiplicative rules using tensor algebra, a naive approach being intractable. Finally, we show on a Music Structure Analysis task that unsupervised NTD fitted with beta-divergence loss outperforms earlier results obtained with the Euclidean loss.  ( 2 min )
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v3 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.  ( 2 min )
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v1 [stat.ML])
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.  ( 2 min )
    Explaining Classifiers by Constructing Familiar Concepts. (arXiv:2203.04109v1 [cs.CV])
    Interpreting a large number of neurons in deep learning is difficult. Our proposed `CLAssifier-DECoder' architecture (ClaDec) facilitates the understanding of the output of an arbitrary layer of neurons or subsets thereof. It uses a decoder that transforms the incomprehensible representation of the given neurons to a representation that is more similar to the domain a human is familiar with. In an image recognition problem, one can recognize what information (or concepts) a layer maintains by contrasting reconstructed images of ClaDec with those of a conventional auto-encoder(AE) serving as reference. An extension of ClaDec allows trading comprehensibility and fidelity. We evaluate our approach for image classification using convolutional neural networks. We show that reconstructed visualizations using encodings from a classifier capture more relevant classification information than conventional AEs. This holds although AEs contain more information on the original input. Our user study highlights that even non-experts can identify a diverse set of concepts contained in images that are relevant (or irrelevant) for the classifier. We also compare against saliency based methods that focus on pixel relevance rather than concepts. We show that ClaDec tends to highlight more relevant input areas to classification though outcomes depend on classifier architecture. Code is at \url{https://github.com/JohnTailor/ClaDec}  ( 2 min )
    Taxonomy of Machine Learning Safety: A Survey and Primer. (arXiv:2106.04823v2 [cs.LG] UPDATED)
    The open-world deployment of Machine Learning (ML) algorithms in safety-critical applications such as autonomous vehicles needs to address a variety of ML vulnerabilities such as interpretability, verifiability, and performance limitations. Research explores different approaches to improve ML dependability by proposing new models and training techniques to reduce generalization error, achieve domain adaptation, and detect outlier examples and adversarial attacks. However, there is a missing connection between ongoing ML research and well-established safety principles. In this paper, we present a structured and comprehensive review of ML techniques to improve the dependability of ML algorithms in uncontrolled open-world settings. From this review, we propose the Taxonomy of ML Safety that maps state-of-the-art ML techniques to key engineering safety strategies. Our taxonomy of ML safety presents a safety-oriented categorization of ML techniques to provide guidance for improving dependability of the ML design and development. The proposed taxonomy can serve as a safety checklist to aid designers in improving coverage and diversity of safety strategies employed in any given ML system.  ( 2 min )
    Accounting for uncertainty of non-linear regression models by divisive data resorting. (arXiv:2104.01714v4 [cs.LG] UPDATED)
    This paper focuses on building models of stochastic systems with aleatoric uncertainty. The nature of the considered systems is such that the identical inputs can result in different outputs, i.e. the output is a random variable. This paper suggests a novel algorithm of boosted ensemble training of multiple models for obtaining a probability distribution of an individual output as a function of a system input. The efficiency of the suggested algorithm is demonstrated by comparing it with the Monte-Carlo simulations using the Kolmogorov-Arnold representation as a regression model. This choice of the model is a simple preference of the authors, and the suggested algorithm of ensemble training does not assume any special properties of the model, other than having descriptive capabilities with some expected accuracy for the chosen dataset type.  ( 2 min )
    Robust Multi-Task Learning and Online Refinement for Spacecraft Pose Estimation across Domain Gap. (arXiv:2203.04275v1 [cs.CV])
    This work presents Spacecraft Pose Network v2 (SPNv2), a Convolutional Neural Network (CNN) for pose estimation of noncooperative spacecraft across domain gap. SPNv2 is a multi-scale, multi-task CNN which consists of a shared multi-scale feature encoder and multiple prediction heads that perform different tasks on a shared feature output. These tasks are all related to detection and pose estimation of a target spacecraft from an image, such as prediction of pre-defined satellite keypoints, direct pose regression, and binary segmentation of the satellite foreground. It is shown that by jointly training on different yet related tasks with extensive data augmentations on synthetic images only, the shared encoder learns features that are common across image domains that have fundamentally different visual characteristics compared to synthetic images. This work also introduces Online Domain Refinement (ODR) which refines the parameters of the normalization layers of SPNv2 on the target domain images online at deployment. Specifically, ODR performs self-supervised entropy minimization of the predicted satellite foreground, thereby improving the CNN's performance on the target domain images without their pose labels and with minimal computational efforts. The GitHub repository for SPNv2 will be made available in the near future.  ( 2 min )
    Plumeria at SemEval-2022 Task 6: Robust Approaches for Sarcasm Detection for English and Arabic Using Transformers and Data Augmentation. (arXiv:2203.04111v1 [cs.CL])
    This paper describes our submission to SemEval-2022 Task 6 on sarcasm detection and its five subtasks for English and Arabic. Sarcasm conveys a meaning which contradicts the literal meaning, and it is mainly found on social networks. It has a significant role in understanding the intention of the user. For detecting sarcasm, we used deep learning techniques based on transformers due to its success in the field of Natural Language Processing (NLP) without the need for feature engineering. The datasets were taken from tweets. We created new datasets by augmenting with external data or by using word embeddings and repetition of instances. Experiments were done on the datasets with different types of preprocessing because it is crucial in this task. The rank of our team was consistent across four subtasks (fourth rank in three subtasks and sixth rank in one subtask); whereas other teams might be in the top ranks for some subtasks but rank drastically less in other subtasks. This implies the robustness and stability of the models and the techniques we used.  ( 2 min )
    Secure PAC Bayesian Regression via Real Shamir Secret Sharing. (arXiv:2109.11200v2 [cs.LG] UPDATED)
    A common approach of system identification and machine learning is to generate a model by using training data to predict the test data instances as accurate as possible. Nonetheless, concerns about data privacy are increasingly raised, but not always addressed. We present a secure protocol for learning a linear model relying on recently described technique called real number secret sharing. We take as our starting point the PAC Bayesian bounds and deduce a closed form for the model parameters which depends on the data and the prior from the PAC Bayesian bounds. To obtain the model parameters one needs to solve a linear system. However, we consider the situation where several parties hold different data instances and they are not willing to give up the privacy of the data. Hence, we suggest to use real number secret sharing and multiparty computation to share the data and solve the linear regression in a secure way without violating the privacy of data. We suggest two methods; a secure inverse method and a secure Gaussian elimination method, and compare these methods at the end. The benefit of using secret sharing directly on real numbers is reflected in the simplicity of the protocols and the number of rounds needed. However, this comes with the drawback that a share might leak a small amount of information, but in our analysis we argue that the leakage is small.  ( 2 min )
    Learning Bidirectional Translation between Descriptions and Actions with Small Paired Data. (arXiv:2203.04218v1 [cs.RO])
    This study achieved bidirectional translation between descriptions and actions using small paired data. The ability to mutually generate descriptions and actions is essential for robots to collaborate with humans in their daily lives. The robot is required to associate real-world objects with linguistic expressions, and large-scale paired data are required for machine learning approaches. However, a paired dataset is expensive to construct and difficult to collect. This study proposes a two-stage training method for bidirectional translation. In the proposed method, we train recurrent autoencoders (RAEs) for descriptions and actions with a large amount of non-paired data. Then, we fine-tune the entire model to bind their intermediate representations using small paired data. Because the data used for pre-training do not require pairing, behavior-only data or a large language corpus can be used. We experimentally evaluated our method using a paired dataset consisting of motion-captured actions and descriptions. The results showed that our method performed well, even when the amount of paired data to train was small. The visualization of the intermediate representations of each RAE showed that similar actions were encoded in a clustered position and the corresponding feature vectors well aligned.  ( 2 min )
    Tight bounds for minimum l1-norm interpolation of noisy data. (arXiv:2111.05987v2 [math.ST] UPDATED)
    We provide matching upper and lower bounds of order $\sigma^2/\log(d/n)$ for the prediction error of the minimum $\ell_1$-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when $d \gg n$, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum $\ell_2$-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.  ( 2 min )
    New Insights on Reducing Abrupt Representation Change in Online Continual Learning. (arXiv:2203.03798v1 [cs.LG])
    In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. Experience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. In this work, we focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. We shed new light on this question by showing that applying ER causes the newly added classes' representations to overlap significantly with the previous classes, leading to highly disruptive parameter updates. Based on this empirical analysis, we propose a new method which mitigates this issue by shielding the learned representations from drastic adaptation to accommodate new classes. We show that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries, where much of the forgetting typically occurs. Empirical results show significant gains over strong baselines on standard continual learning benchmarks  ( 2 min )
    Distributed Learning of Deep Neural Networks using Independent Subnet Training. (arXiv:1910.02120v6 [cs.LG] UPDATED)
    Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.  ( 3 min )
    Contrastive Conditional Neural Processes. (arXiv:2203.03978v1 [cs.LG])
    Conditional Neural Processes~(CNPs) bridge neural networks with probabilistic inference to approximate functions of Stochastic Processes under meta-learning settings. Given a batch of non-{\it i.i.d} function instantiations, CNPs are jointly optimized for in-instantiation observation prediction and cross-instantiation meta-representation adaptation within a generative reconstruction pipeline. There can be a challenge in tying together such two targets when the distribution of function observations scales to high-dimensional and noisy spaces. Instead, noise contrastive estimation might be able to provide more robust representations by learning distributional matching objectives to combat such inherent limitation of generative models. In light of this, we propose to equip CNPs by 1) aligning prediction with encoded ground-truth observation, and 2) decoupling meta-representation adaptation from generative reconstruction. Specifically, two auxiliary contrastive branches are set up hierarchically, namely in-instantiation temporal contrastive learning~({\tt TCL}) and cross-instantiation function contrastive learning~({\tt FCL}), to facilitate local predictive alignment and global function consistency, respectively. We empirically show that {\tt TCL} captures high-level abstraction of observations, whereas {\tt FCL} helps identify underlying functions, which in turn provides more efficient representations. Our model outperforms other CNPs variants when evaluating function distribution reconstruction and parameter identification across 1D, 2D and high-dimensional time-series.  ( 2 min )
    End-to-end Multiple Instance Learning with Gradient Accumulation. (arXiv:2203.03981v1 [cs.CV])
    Being able to learn on weakly labeled data, and provide interpretability, are two of the main reasons why attention-based deep multiple instance learning (ABMIL) methods have become particularly popular for classification of histopathological images. Such image data usually come in the form of gigapixel-sized whole-slide-images (WSI) that are cropped into smaller patches (instances). However, the sheer size of the data makes training of ABMIL models challenging. All the instances from one WSI cannot be processed at once by conventional GPUs. Existing solutions compromise training by relying on pre-trained models, strategic sampling or selection of instances, or self-supervised learning. We propose a training strategy based on gradient accumulation that enables direct end-to-end training of ABMIL models without being limited by GPU memory. We conduct experiments on both QMNIST and Imagenette to investigate the performance and training time, and compare with the conventional memory-expensive baseline and a recent sampled-based approach. This memory-efficient approach, although slower, reaches performance indistinguishable from the memory-expensive baseline.  ( 2 min )
    YONO: Modeling Multiple Heterogeneous Neural Networks on Microcontrollers. (arXiv:2203.03794v1 [cs.LG])
    With the advancement of Deep Neural Networks (DNN) and large amounts of sensor data from Internet of Things (IoT) systems, the research community has worked to reduce the computational and resource demands of DNN to compute on low-resourced microcontrollers (MCUs). However, most of the current work in embedded deep learning focuses on solving a single task efficiently, while the multi-tasking nature and applications of IoT devices demand systems that can handle a diverse range of tasks (activity, voice, and context recognition) with input from a variety of sensors, simultaneously. In this paper, we propose YONO, a product quantization (PQ) based approach that compresses multiple heterogeneous models and enables in-memory model execution and switching for dissimilar multi-task learning on MCUs. We first adopt PQ to learn codebooks that store weights of different models. Also, we propose a novel network optimization and heuristics to maximize the compression rate and minimize the accuracy loss. Then, we develop an online component of YONO for efficient model execution and switching between multiple tasks on an MCU at run time without relying on an external storage device. YONO shows remarkable performance as it can compress multiple heterogeneous models with negligible or no loss of accuracy up to 12.37$\times$. Besides, YONO's online component enables an efficient execution (latency of 16-159 ms per operation) and reduces model loading/switching latency and energy consumption by 93.3-94.5% and 93.9-95.0%, respectively, compared to external storage access. Interestingly, YONO can compress various architectures trained with datasets that were not shown during YONO's offline codebook learning phase showing the generalizability of our method. To summarize, YONO shows great potential and opens further doors to enable multi-task learning systems on extremely resource-constrained devices.  ( 2 min )
    Zero-delay Consistent and Smooth Trainable Interpolation. (arXiv:2203.03776v1 [cs.LG])
    The question of how to produce a smooth interpolating curve from a stream of data points is addressed in this paper. To this end, we formalize the concept of real-time interpolator (RTI): a trainable unit that recovers smooth signals that are consistent with the received input samples in an online manner. Specifically, an RTI works under the requirement of producing a function section immediately after a sample is received (zero delay), without changing the reconstructed signal in past time sections. This work formulates the design of spline-based RTIs as a bi-level optimization problem. Their training consists in minimizing the average curvature of the interpolated signals over a set of example sequences. The latter are representative of the nature of the data sequence to be interpolated, allowing to tailor the RTI to a specific signal source. Our overall design allows for different possible schemes. In this work, we present two approaches, namely, the parametrized RTI and the recurrent neural network (RNN)-based RTI, including their architecture and properties. Experimental results show that the two proposed RTIs can be trained in a data-driven fashion to achieve improved performance (in terms of the curvature loss metric) with respect to a myopic-type RTI that only exploits the local information at each time sample, while maintaining smooth, zero-delay, and consistency requirements.  ( 2 min )
    High-order Order Proximity-Incorporated, Symmetry and Graph-Regularized Nonnegative Matrix Factorization for Community Detection. (arXiv:2203.03876v1 [cs.SI])
    Community describes the functional mechanism of a network, making community detection serve as a fundamental graph tool for various real applications like discovery of social circle. To date, a Symmetric and Non-negative Matrix Factorization (SNMF) model has been frequently adopted to address this issue owing to its high interpretability and scalability. However, most existing SNMF-based community detection methods neglect the high-order connection patterns in a network. Motivated by this discovery, in this paper, we propose a High-Order Proximity (HOP)-incorporated, Symmetry and Graph-regularized NMF (HSGN) model that adopts the following three-fold ideas: a) adopting a weighted pointwise mutual information (PMI)-based approach to measure the HOP indices among nodes in a network; b) leveraging an iterative reconstruction scheme to encode the captured HOP into the network; and c) introducing a symmetry and graph-regularized NMF algorithm to detect communities accurately. Extensive empirical studies on eight real-world networks demonstrate that an HSGN-based community detector significantly outperforms both benchmark and state-of-the-art community detectors in providing highly-accurate community detection results.  ( 2 min )
    Adapt$\mathcal{O}$r: Objective-Centric Adaptation Framework for Language Models. (arXiv:2203.03989v1 [cs.CL])
    Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.  ( 2 min )
    Online Weak-form Sparse Identification of Partial Differential Equations. (arXiv:2203.03979v1 [math.OC])
    This paper presents an online algorithm for identification of partial differential equations (PDEs) based on the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy). The algorithm is online in a sense that if performs the identification task by processing solution snapshots that arrive sequentially. The core of the method combines a weak-form discretization of candidate PDEs with an online proximal gradient descent approach to the sparse regression problem. In particular, we do not regularize the $\ell_0$-pseudo-norm, instead finding that directly applying its proximal operator (which corresponds to a hard thresholding) leads to efficient online system identification from noisy data. We demonstrate the success of the method on the Kuramoto-Sivashinsky equation, the nonlinear wave equation with time-varying wavespeed, and the linear wave equation, in one, two, and three spatial dimensions, respectively. In particular, our examples show that the method is capable of identifying and tracking systems with coefficients that vary abruptly in time, and offers a streaming alternative to problems in higher dimensions.  ( 2 min )
    A Predictive Model for Student Performance in Classrooms Using Student Interactions With an eTextbook. (arXiv:2203.03713v1 [cs.CY])
    With the rise of online eTextbooks and Massive Open Online Courses (MOOCs), a huge amount of data has been collected related to students' learning. With the careful analysis of this data, educators can gain useful insights into the performance of their students and their behavior in learning a particular topic. This paper proposes a new model for predicting student performance based on an analysis of how students interact with an interactive online eTextbook. By being able to predict students' performance early in the course, educators can easily identify students at risk and provide a suitable intervention. We considered two main issues the prediction of good/bad performance and the prediction of the final exam grade. To build the proposed model, we evaluated the most popular classification and regression algorithms on data from a data structures and algorithms course (CS2) offered in a large public research university. Random Forest Regression and Multiple Linear Regression have been applied in Regression. While Logistic Regression, decision tree, Random Forest Classifier, K Nearest Neighbors, and Support Vector Machine have been applied in classification.  ( 2 min )
    A Preliminary Study on Aging Examining Online Handwriting. (arXiv:2203.03933v1 [cs.HC])
    In order to develop infocommunications devices so that the capabilities of the human brain may interact with the capabilities of any artificially cognitive system a deeper knowledge of aging is necessary. Especially if society does not want to exclude elder people and wants to develop automatic systems able to help and improve the quality of life of this group of population, healthy individuals as well as those with cognitive decline or other pathologies. This paper tries to establish the variations in handwriting tasks with the goal to obtain a better knowledge about aging. We present the correlation results between several parameters extracted from online handwriting and the age of the writers. It is based on BIOSECURID database, which consists of 400 people that provided several biometric traits, including online handwriting. The main idea is to identify those parameters that are more stable and those more age dependent. One challenging topic for disease diagnose is the differentiation between healthy and pathological aging. For this purpose, it is necessary to be aware of handwriting parameters that are, in general, not affected by aging and those who experiment changes, increase or decrease their values, because of it. This paper contributes to this research line analyzing a selected set of online handwriting parameters provided by a healthy group of population aged from 18 to 70 years. Preliminary results show that these parameters are not affected by aging and therefore, changes in their values can only be attributed to motor or cognitive disorders.  ( 2 min )
    Universal Graph Transformer Self-Attention Networks. (arXiv:1909.11855v13 [cs.LG] UPDATED)
    We introduce a transformer-based GNN model, named UGformer, to learn graph representations. In particular, we present two UGformer variants, wherein the first variant (publicized in September 2019) is to leverage the transformer on a set of sampled neighbors for each input node, while the second (publicized in May 2021) is to leverage the transformer on all input nodes. Experimental results demonstrate that the first UGformer variant achieves state-of-the-art accuracies on benchmark datasets for graph classification in both inductive setting and unsupervised transductive setting; and the second UGformer variant obtains state-of-the-art accuracies for inductive text classification. The code is available at: \url{https://github.com/daiquocnguyen/Graph-Transformer}.  ( 2 min )
    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. (arXiv:2112.10741v3 [cs.CV] UPDATED)
    Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.  ( 2 min )
    Low-Loss Subspace Compression for Clean Gains against Multi-Agent Backdoor Attacks. (arXiv:2203.03692v1 [cs.LG])
    Recent exploration of the multi-agent backdoor attack demonstrated the backfiring effect, a natural defense against backdoor attacks where backdoored inputs are randomly classified. This yields a side-effect of low accuracy w.r.t. clean labels, which motivates this paper's work on the construction of multi-agent backdoor defenses that maximize accuracy w.r.t. clean labels and minimize that of poison labels. Founded upon agent dynamics and low-loss subspace construction, we contribute three defenses that yield improved multi-agent backdoor robustness.  ( 2 min )
    RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests. (arXiv:2106.08217v2 [stat.ML] UPDATED)
    Like many predictive models, random forests provide point predictions for new observations. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests (PIBF) and 15 method variations to produce prediction intervals with random forests, as proposed by Roy and Larocque (2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods for building prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, outperforms competing methods.  ( 2 min )
    Data adaptive RKHS Tikhonov regularization for learning kernels in operators. (arXiv:2203.03791v1 [stat.ML])
    We present DARTR: a Data Adaptive RKHS Tikhonov Regularization method for the linear inverse problem of nonparametric learning of function parameters in operators. A key ingredient is a system intrinsic data-adaptive (SIDA) RKHS, whose norm restricts the learning to take place in the function space of identifiability. DARTR utilizes this norm and selects the regularization parameter by the L-curve method. We illustrate its performance in examples including integral operators, nonlinear operators and nonlocal operators with discrete synthetic data. Numerical results show that DARTR leads to an accurate estimator robust to both numerical error due to discrete data and noise in data, and the estimator converges at a consistent rate as the data mesh refines under different levels of noises, outperforming two baseline regularizers using $l^2$ and $L^2$ norms.  ( 2 min )
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v2 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.  ( 2 min )
    Learning Sensorimotor Primitives of Sequential Manipulation Tasks from Visual Demonstrations. (arXiv:2203.03797v1 [cs.RO])
    This work aims to learn how to perform complex robot manipulation tasks that are composed of several, consecutively executed low-level sub-tasks, given as input a few visual demonstrations of the tasks performed by a person. The sub-tasks consist of moving the robot's end-effector until it reaches a sub-goal region in the task space, performing an action, and triggering the next sub-task when a pre-condition is met. Most prior work in this domain has been concerned with learning only low-level tasks, such as hitting a ball or reaching an object and grasping it. This paper describes a new neural network-based framework for learning simultaneously low-level policies as well as high-level policies, such as deciding which object to pick next or where to place it relative to other objects in the scene. A key feature of the proposed approach is that the policies are learned directly from raw videos of task demonstrations, without any manual annotation or post-processing of the data. Empirical results on object manipulation tasks with a robotic arm show that the proposed network can efficiently learn from real visual demonstrations to perform the tasks, and outperforms popular imitation learning algorithms.  ( 2 min )
    Learning to Bound: A Generative Cram\'er-Rao Bound. (arXiv:2203.03695v1 [cs.LG])
    The Cram\'er-Rao bound (CRB), a well-known lower bound on the performance of any unbiased parameter estimator, has been used to study a wide variety of problems. However, to obtain the CRB, requires an analytical expression for the likelihood of the measurements given the parameters, or equivalently a precise and explicit statistical model for the data. In many applications, such a model is not available. Instead, this work introduces a novel approach to approximate the CRB using data-driven methods, which removes the requirement for an analytical statistical model. This approach is based on the recent success of deep generative models in modeling complex, high-dimensional distributions. Using a learned normalizing flow model, we model the distribution of the measurements and obtain an approximation of the CRB, which we call Generative Cram\'er-Rao Bound (GCRB). Numerical experiments on simple problems validate this approach, and experiments on two image processing tasks of image denoising and edge detection with a learned camera noise model demonstrate its power and benefits.  ( 2 min )
    HyperMixer: An MLP-based Green AI Alternative to Transformers. (arXiv:2203.03691v1 [cs.CL])
    Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length and can be difficult to tune. In the pursuit of Green AI, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.  ( 2 min )
    Sparsification and Filtering for Spatial-temporal GNN in Multivariate Time-series. (arXiv:2203.03991v1 [cs.LG])
    We propose an end-to-end architecture for multivariate time-series prediction that integrates a spatial-temporal graph neural network with a matrix filtering module. This module generates filtered (inverse) correlation graphs from multivariate time series before inputting them into a GNN. In contrast with existing sparsification methods adopted in graph neural network, our model explicitly leverage time-series filtering to overcome the low signal-to-noise ratio typical of complex systems data. We present a set of experiments, where we predict future sales from a synthetic time-series sales dataset. The proposed spatial-temporal graph neural network displays superior performances with respect to baseline approaches, with no graphical information, and with fully connected, disconnected graphs and unfiltered graphs.  ( 2 min )
    Multi-Modal Mixup for Robust Fine-tuning. (arXiv:2203.03897v1 [cs.CV])
    Pre-trained large-scale models provide a transferable embedding, and they show comparable performance on the diverse downstream task. However, the transferability of multi-modal learning is restricted, and the analysis of learned embedding has not been explored well. This paper provides a perspective to understand the multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has a two separated representation space for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between two modalities with less uniformity. Less robust embedding might restrict the transferability of the representation for the downstream task. This paper provides a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a multi-modal Mixup, $m^{2}$-Mix that mixes the representation of image and text to generate the hard negative samples. Second, we fine-tune the multi-modal model on a hard negative sample as well as normal negative and positive samples with contrastive learning. Our multi-modal Mixup provides a robust representation, and we validate our methods on classification, retrieval, and structure-awareness task.  ( 2 min )
    Encoding Event-Based Data With a Hybrid SNN Guided Variational Auto-encoder in Neuromorphic Hardware. (arXiv:2104.00165v2 [cs.NE] UPDATED)
    Neuromorphic hardware equipped with learning capabilities can adapt to new, real-time data. While models of Spiking Neural Networks (SNNs) can now be trained using gradient descent to reach an accuracy comparable to equivalent conventional neural networks, such learning often relies on external labels. However, real-world data is unlabeled which can make supervised methods inapplicable. To solve this problem, we propose a Hybrid Guided Variational Autoencoder (VAE) which encodes event based data sensed by a Dynamic Vision Sensor (DVS) into a latent space representation using an SNN. These representations can be used as an embedding to measure data similarity and predict labels in real-world data. We show that the Hybrid Guided-VAE achieves 87% classification accuracy on the DVSGesture dataset and it can encode the sparse, noisy inputs into an interpretable latent space representation, visualized through T-SNE plots. We also implement the encoder component of the model on neuromorphic hardware and discuss the potential for our algorithm to enable real-time learning from real-world event data.  ( 2 min )
    Convergence Rates for Gaussian Mixtures of Experts. (arXiv:1907.04377v2 [math.ST] UPDATED)
    We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of \emph{algebraic independence} of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation.  ( 2 min )
    A Push-Relabel Based Additive Approximation for Optimal Transport. (arXiv:2203.03732v1 [cs.LG])
    Optimal Transport is a popular distance metric for measuring similarity between distributions. Exact algorithms for computing Optimal Transport can be slow, which has motivated the development of approximate numerical solvers (e.g. Sinkhorn method). We introduce a new and very simple combinatorial approach to find an $\varepsilon$-approximation of the OT distance. Our algorithm achieves a near-optimal execution time of $O(n^2/\varepsilon^2)$ for computing OT distance and, for the special case of the assignment problem, the execution time improves to $O(n^2/\varepsilon)$. Our algorithm is based on the push-relabel framework for min-cost flow problems. Unlike the other combinatorial approach (Lahn, Mulchandani and Raghvendra, NeurIPS 2019) which does not have a fast parallel implementation, our algorithm has a parallel execution time of $O(\log n/\varepsilon^2)$. Interestingly, unlike the Sinkhorn algorithm, our method also readily provides a compact transport plan as well as a solution to an approximate version of the dual formulation of the OT problem, both of which have numerous applications in Machine Learning. For the assignment problem, we provide both a CPU implementation as well as an implementation that exploits GPU parallelism. Experiments suggest that our algorithm is faster than the Sinkhorn algorithm, both in terms of CPU and GPU implementations, especially while computing matchings with a high accuracy.  ( 2 min )
    Obstacle Aware Sampling for Path Planning. (arXiv:2203.04075v1 [cs.RO])
    Many path planning algorithms are based on sampling the state space. While this approach is very simple, it can become costly when the obstacles are unknown, since samples hitting these obstacles are wasted. The goal of this paper is to efficiently identify obstacles in a map and remove them from the sampling space. To this end, we propose a pre-processing algorithm for space exploration that enables more efficient sampling. We show that it can boost the performance of other space sampling methods and path planners. Our approach is based on the fact that a convex obstacle can be approximated provably well by its minimum volume enclosing ellipsoid (MVEE), and a non-convex obstacle may be partitioned into convex shapes. Our main contribution is an algorithm that strategically finds a small sample, called the \emph{active-coreset}, that adaptively samples the space via membership-oracle such that the MVEE of the coreset approximates the MVEE of the obstacle. Experimental results confirm the effectiveness of our approach across multiple planners based on Rapidly-exploring random trees, showing significant improvement in terms of time and path length.  ( 2 min )
    VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer. (arXiv:2203.04099v1 [cs.SD])
    This paper presents an audio-visual approach for voice separation which outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights will be made publicly available at https://ipcv.github.io/VoViT/  ( 2 min )
    EntQA: Entity Linking as Question Answering. (arXiv:2110.02369v2 [cs.CL] UPDATED)
    A conventional approach to entity linking is to first find mentions in a given document and then infer their underlying entities in the knowledge base. A well-known limitation of this approach is that it requires finding mentions without knowing their entities, which is unnatural and difficult. We present a new model that does not suffer from this limitation called EntQA, which stands for Entity linking as Question Answering. EntQA first proposes candidate entities with a fast retrieval module, and then scrutinizes the document to find mentions of each candidate with a powerful reader module. Our approach combines progress in entity linking with that in open-domain question answering and capitalizes on pretrained models for dense entity retrieval and reading comprehension. Unlike in previous works, we do not rely on a mention-candidates dictionary or large-scale weak supervision. EntQA achieves strong results on the GERBIL benchmarking platform.  ( 2 min )
    A Sharp Characterization of Linear Estimators for Offline Policy Evaluation. (arXiv:2203.04236v1 [cs.LG])
    Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation (OPE), unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of OPE.  ( 2 min )
    Any Variational Autoencoder Can Do Arbitrary Conditioning. (arXiv:2201.12414v2 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables any Variational Autoencoder (VAE) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching achieves performance that is comparable or superior to current state-of-the-art methods for a variety of tasks.  ( 2 min )
    QuatRE: Relation-Aware Quaternions for Knowledge Graph Embeddings. (arXiv:2009.12517v2 [cs.CL] UPDATED)
    We propose a simple yet effective embedding model to learn quaternion embeddings for entities and relations in knowledge graphs. Our model aims to enhance correlations between head and tail entities given a relation within the Quaternion space with Hamilton product. The model achieves this goal by further associating each relation with two relation-aware rotations, which are used to rotate quaternion embeddings of the head and tail entities, respectively. Experimental results show that our proposed model produces state-of-the-art performances on well-known benchmark datasets for knowledge graph completion. Our code is available at: \url{https://github.com/daiquocnguyen/QuatRE}.  ( 2 min )
    Multi-Label Annotation of Chest Abdomen Pelvis Computed Tomography Text Reports Using Deep Learning. (arXiv:2102.02959v5 [cs.AI] UPDATED)
    Purpose: To develop high throughput multi-label annotators for body (chest, abdomen, and pelvis) Computed Tomography (CT) reports that can be applied across a variety of abnormalities, organs, and disease states. Approach: We used a dictionary approach to develop rule-based algorithms (RBA) for extraction of disease labels from radiology text reports. We targeted three organ systems (lungs/pleura, liver/gallbladder, kidneys/ureters) with four diseases per system based on their prevalence in our dataset. To expand the algorithms beyond pre-defined keywords, attention-guided recurrent neural networks (RNN) were trained using the RBA-extracted labels to classify reports as being positive for one or more diseases or normal for each organ system. Confounding effects on model performance were evaluated using random initialization or pre-trained embedding as well as different sizes of training datasets. Performance was evaluated using the receiver operating characteristic (ROC) area under the curve (AUC) against 2,158 manually obtained labels. Results: Our models extracted disease labels from 261,229 radiology reports of 112,501 unique subjects. Pre-trained models outperformed random initialization across all diseases. As the training dataset size was reduced, performance was robust except for a few diseases with relatively small number of cases. Pre-trained classification AUCs achieved > 0.95 for all five disease outcomes across all three organ systems. Conclusions: Our label-extracting pipeline was able to encompass a variety of cases and diseases by generalizing beyond strict rules with exceptional accuracy. This method can be easily adapted to enable automated labeling of hospital-scale medical data sets for training image-based disease classifiers.  ( 3 min )
    Robot Learning of Mobile Manipulation with Reachability Behavior Priors. (arXiv:2203.04051v1 [cs.RO])
    Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp  ( 2 min )
    Defending Graph Convolutional Networks against Dynamic Graph Perturbations via Bayesian Self-supervision. (arXiv:2203.03762v1 [cs.LG])
    In recent years, plentiful evidence illustrates that Graph Convolutional Networks (GCNs) achieve extraordinary accomplishments on the node classification task. However, GCNs may be vulnerable to adversarial attacks on label-scarce dynamic graphs. Many existing works aim to strengthen the robustness of GCNs; for instance, adversarial training is used to shield GCNs against malicious perturbations. However, these works fail on dynamic graphs for which label scarcity is a pressing issue. To overcome label scarcity, self-training attempts to iteratively assign pseudo-labels to highly confident unlabeled nodes but such attempts may suffer serious degradation under dynamic graph perturbations. In this paper, we generalize noisy supervision as a kind of self-supervised learning method and then propose a novel Bayesian self-supervision model, namely GraphSS, to address the issue. Extensive experiments demonstrate that GraphSS can not only affirmatively alert the perturbations on dynamic graphs but also effectively recover the prediction of a node classifier when the graph is under such perturbations. These two advantages prove to be generalized over three classic GCNs across five public graph datasets.  ( 2 min )
    Biological Sequence Design with GFlowNets. (arXiv:2203.04115v1 [q-bio.BM])
    Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.  ( 2 min )
    Robustness and Usefulness in AI Explanation Methods. (arXiv:2203.03729v1 [cs.LG])
    Explainability in machine learning has become incredibly important as machine learning-powered systems become ubiquitous and both regulation and public sentiment begin to demand an understanding of how these systems make decisions. As a result, a number of explanation methods have begun to receive widespread adoption. This work summarizes, compares, and contrasts three popular explanation methods: LIME, SmoothGrad, and SHAP. We evaluate these methods with respect to: robustness, in the sense of sample complexity and stability; understandability, in the sense that provided explanations are consistent with user expectations; and usability, in the sense that the explanations allow for the model to be modified based on the output. This work concludes that current explanation methods are insufficient; that putting faith in and adopting these methods may actually be worse than simply not using them.  ( 2 min )
    Ensemble deep learning: A review. (arXiv:2104.02395v2 [cs.LG] UPDATED)
    Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions.  ( 2 min )
    Strength of Minibatch Noise in SGD. (arXiv:2102.05375v3 [cs.LG] UPDATED)
    The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.  ( 2 min )
    Follow the Water: Finding Water, Snow and Clouds on Terrestrial Exoplanets with Photometry and Machine Learning. (arXiv:2203.04201v1 [astro-ph.EP])
    All life on Earth needs water. NASA's quest to follow the water links water to the search for life in the cosmos. Telescopes like JWST and mission concepts like HabEx, LUVOIR and Origins are designed to characterise rocky exoplanets spectroscopically. However, spectroscopy remains time-intensive and therefore, initial characterisation is critical to prioritisation of targets. Here, we study machine learning as a tool to assess water's existence through broadband-filter reflected photometric flux on Earth-like exoplanets in three forms: seawater, water-clouds and snow; based on 53,130 spectra of cold, Earth-like planets with 6 major surfaces. XGBoost, a well-known machine learning algorithm, achieves over 90\% balanced accuracy in detecting the existence of snow or clouds for S/N$\gtrsim 20$, and 70\% for liquid seawater for S/N $\gtrsim 30$. Finally, we perform mock Bayesian analysis with Markov-chain Monte Carlo with five filters identified to derive exact surface compositions to test for retrieval feasibility. The results show that the use of machine learning to identify water on the surface of exoplanets from broadband-filter photometry provides a promising initial characterisation tool of water in different forms. Planned small and large telescope missions could use this to aid their prioritisation of targets for time-intense follow-up observations.  ( 2 min )
    A Gating Model for Bias Calibration in Generalized Zero-shot Learning. (arXiv:2203.04195v1 [cs.LG])
    Generalized zero-shot learning (GZSL) aims at training a model that can generalize to unseen class data by only using auxiliary information. One of the main challenges in GZSL is a biased model prediction toward seen classes caused by overfitting on only available seen class data during training. To overcome this issue, we propose a two-stream autoencoder-based gating model for GZSL. Our gating model predicts whether the query data is from seen classes or unseen classes, and utilizes separate seen and unseen experts to predict the class independently from each other. This framework avoids comparing the biased prediction scores for seen classes with the prediction scores for unseen classes. In particular, we measure the distance between visual and attribute representations in the latent space and the cross-reconstruction space of the autoencoder. These distances are utilized as complementary features to characterize unseen classes at different levels of data abstraction. Also, the two-stream autoencoder works as a unified framework for the gating model and the unseen expert, which makes the proposed method computationally efficient. We validate our proposed method in four benchmark image recognition datasets. In comparison with other state-of-the-art methods, we achieve the best harmonic mean accuracy in SUN and AWA2, and the second best in CUB and AWA1. Furthermore, our base model requires at least 20% less number of model parameters than state-of-the-art methods relying on generative models.  ( 2 min )
    Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. (arXiv:2201.07207v2 [cs.LG] UPDATED)
    Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. However, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. Website at https://huangwl18.github.io/language-planner  ( 2 min )
    FAR: A General Framework for Attributional Robustness. (arXiv:2010.07393v2 [cs.LG] UPDATED)
    Attribution maps are popular tools for explaining neural networks predictions. By assigning an importance value to each input dimension that represents its impact towards the outcome, they give an intuitive explanation of the decision process. However, recent work has discovered vulnerability of these maps to imperceptible adversarial changes, which can prove critical in safety-relevant domains such as healthcare. Therefore, we define a novel generic framework for attributional robustness (FAR) as general problem formulation for training models with robust attributions. This framework consist of a generic regularization term and training objective that minimize the maximal dissimilarity of attribution maps in a local neighbourhood of the input. We show that FAR is a generalized, less constrained formulation of currently existing training methods. We then propose two new instantiations of this framework, AAT and AdvAAT, that directly optimize for both robust attributions and predictions. Experiments performed on widely used vision datasets show that our methods perform better or comparably to current ones in terms of attributional robustness while being more generally applicable. We finally show that our methods mitigate undesired dependencies between attributional robustness and some training and estimation parameters, which seem to critically affect other competitor methods.  ( 2 min )
    A Compilation Flow for the Generation of CNN Inference Accelerators on FPGAs. (arXiv:2203.04015v1 [cs.DC])
    We present a compilation flow for the generation of CNN inference accelerators on FPGAs. The flow translates a frozen model into OpenCL kernels with the TVM compiler and uses the Intel OpenCL SDK to compile to an FPGA bitstream. We improve the quality of the generated hardware with optimizations applied to the base OpenCL kernels generated by TVM. These optimizations increase parallelism, reduce memory access latency, increase concurrency and save on-chip resources. We automate these optimizations in TVM and evaluate them by generating accelerators for LeNet-5, MobileNetV1 and ResNet-34 on an Intel Stratix~10SX. We show that the optimizations improve the performance of the generated accelerators by up to 846X over the base accelerators. The performance of the optimized accelerators is up to 4.57X better than TensorFlow on CPU, 3.83X better than single-threaded TVM and is only 0.34X compared to TVM with 56 threads. Our optimized kernels also outperform ones generated by a similar approach (that also uses high-level synthesis) while providing more functionality and flexibility. However, it underperforms an approach that utilizes hand-optimized designs. Thus, we view our approach as useful in pre-production environments that benefit from increased performance and fast prototyping, realizing the benefits of FPGAs without hardware design expertise.  ( 2 min )
    PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning. (arXiv:2106.04590v2 [cs.LG] UPDATED)
    We propose a new framework of synthesizing data using deep generative models in a differentially private manner. Within our framework, sensitive data are sanitized with rigorous privacy guarantees in a one-shot fashion, such that training deep generative models is possible without re-using the original data. Hence, no extra privacy costs or model constraints are incurred, in contrast to popular approaches such as Differentially Private Stochastic Gradient Descent (DP-SGD), which, among other issues, causes degradation in privacy guarantees as the training iteration increases. We demonstrate a realization of our framework by making use of the characteristic function and an adversarial re-weighting objective, which are of independent interest as well. Our proposal has theoretical guarantees of performance, and empirical evaluations on multiple datasets show that our approach outperforms other methods at reasonable levels of privacy.  ( 2 min )
    On Generalizing Beyond Domains in Cross-Domain Continual Learning. (arXiv:2203.03970v1 [cs.LG])
    Humans have the ability to accumulate knowledge of new tasks in varying conditions, but deep neural networks often suffer from catastrophic forgetting of previously learned knowledge after learning a new task. Many recent methods focus on preventing catastrophic forgetting under the assumption of train and test data following similar distributions. In this work, we consider a more realistic scenario of continual learning under domain shifts where the model must generalize its inference to an unseen domain. To this end, we encourage learning semantically meaningful features by equipping the classifier with class similarity metrics as learning parameters which are obtained through Mahalanobis similarity computations. Learning of the backbone representation along with these extra parameters is done seamlessly in an end-to-end manner. In addition, we propose an approach based on the exponential moving average of the parameters for better knowledge distillation. We demonstrate that, to a great extent, existing continual learning algorithms fail to handle the forgetting issue under multiple distributions, while our proposed approach learns new tasks under domain shift with accuracy boosts up to 10% on challenging datasets such as DomainNet and OfficeHome.  ( 2 min )
    Evidential Turing Processes. (arXiv:2106.01216v3 [cs.LG] UPDATED)
    A probabilistic classifier with reliable predictive uncertainties i) fits successfully to the target domain data, ii) provides calibrated class probabilities in difficult regions of the target domain (e.g.\ class overlap), and iii) accurately identifies queries coming out of the target domain and rejects them. We introduce an original combination of Evidential Deep Learning, Neural Processes, and Neural Turing Machines capable of providing all three essential properties mentioned above for total uncertainty quantification. We observe our method on five classification tasks to be the only one that can excel all three aspects of total calibration with a single standalone predictor. Our unified solution delivers an implementation-friendly and compute efficient recipe for safety clearance and provides intellectual economy to an investigation of algorithmic roots of epistemic awareness in deep neural nets.  ( 2 min )
    Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding. (arXiv:2203.03838v1 [cs.CV])
    Query-based video grounding is an important yet challenging task in video understanding, which aims to localize the target segment in an untrimmed video according to a sentence query. Most previous works achieve significant progress by addressing this task in a fully-supervised manner with segment-level labels, which require high labeling cost. Although some recent efforts develop weakly-supervised methods that only need the video-level knowledge, they generally match multiple pre-defined segment proposals with query and select the best one, which lacks fine-grained frame-level details for distinguishing frames with high repeatability and similarity within the entire video. To alleviate the above limitations, we propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting. Firstly, instead of utilizing redundant segment proposals, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames by only using the video-level annotations. Secondly, since some predicted frames (i.e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames. In particular, we iteratively explore multi-scale hard negative samples that are close to positive samples in the representation space for distinguishing fine-grained frame-wise details, thus enforcing more accurate segment grounding. Extensive experiments on two challenging benchmarks demonstrate the superiority of our proposed method compared with the state-of-the-art methods.  ( 2 min )
    Locate This, Not That: Class-Conditioned Sound Event DOA Estimation. (arXiv:2203.04197v1 [eess.AS])
    Existing systems for sound event localization and detection (SELD) typically operate by estimating a source location for all classes at every time instant. In this paper, we propose an alternative class-conditioned SELD model for situations where we may not be interested in localizing all classes all of the time. This class-conditioned SELD model takes as input the spatial and spectral features from the sound file, and also a one-hot vector indicating the class we are currently interested in localizing. We inject the conditioning information at several points in our model using feature-wise linear modulation (FiLM) layers. Through experiments on the DCASE 2020 Task 3 dataset, we show that the proposed class-conditioned SELD model performs better in terms of common SELD metrics than the baseline model that locates all classes simultaneously, and also outperforms specialist models that are trained to locate only a single class of interest. We also evaluate performance on the DCASE 2021 Task 3 dataset, which includes directional interference (sound events from classes we are not interested in localizing) and notice especially strong improvement from the class-conditioned model.  ( 2 min )
    Unsupervised Image Registration Towards Enhancing Performance and Explainability in Cardiac And Brain Image Analysis. (arXiv:2203.03638v1 [eess.IV])
    Magnetic Resonance Imaging (MRI) typically recruits multiple sequences (defined here as "modalities"). As each modality is designed to offer different anatomical and functional clinical information, there are evident disparities in the imaging content across modalities. Inter- and intra-modality affine and non-rigid image registration is an essential medical image analysis process in clinical imaging, as for example before imaging biomarkers need to be derived and clinically evaluated across different MRI modalities, time phases and slices. Although commonly needed in real clinical scenarios, affine and non-rigid image registration is not extensively investigated using a single unsupervised model architecture. In our work, we present an un-supervised deep learning registration methodology which can accurately model affine and non-rigid trans-formations, simultaneously. Moreover, inverse-consistency is a fundamental inter-modality registration property that is not considered in deep learning registration algorithms. To address inverse-consistency, our methodology performs bi-directional cross-modality image synthesis to learn modality-invariant latent rep-resentations, while involves two factorised transformation networks and an inverse-consistency loss to learn topology-preserving anatomical transformations. Overall, our model (named "FIRE") shows improved performances against the reference standard baseline method on multi-modality brain 2D and 3D MRI and intra-modality cardiac 4D MRI data experiments.  ( 2 min )
    Trustable Co-label Learning from Multiple Noisy Annotators. (arXiv:2203.04199v1 [cs.LG])
    Supervised deep learning depends on massive accurately annotated examples, which is usually impractical in many real-world scenarios. A typical alternative is learning from multiple noisy annotators. Numerous earlier works assume that all labels are noisy, while it is usually the case that a few trusted samples with clean labels are available. This raises the following important question: how can we effectively use a small amount of trusted data to facilitate robust classifier learning from multiple annotators? This paper proposes a data-efficient approach, called \emph{Trustable Co-label Learning} (TCL), to learn deep classifiers from multiple noisy annotators when a small set of trusted data is available. This approach follows the coupled-view learning manner, which jointly learns the data classifier and the label aggregator. It effectively uses trusted data as a guide to generate trustable soft labels (termed co-labels). A co-label learning can then be performed by alternately reannotating the pseudo labels and refining the classifiers. In addition, we further improve TCL for a special complete data case, where each instance is labeled by all annotators and the label aggregator is represented by multilayer neural networks to enhance model capacity. Extensive experiments on synthetic and real datasets clearly demonstrate the effectiveness and robustness of the proposed approach. Source code is available at https://github.com/ShikunLi/TCL  ( 2 min )
    Information Theoretic Structured Generative Modeling. (arXiv:2110.05794v2 [cs.LG] UPDATED)
    R\'enyi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that R\'enyi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on R\'enyi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks.  ( 2 min )
    Learning based Age of Information Minimization in UAV-relayed IoT Networks. (arXiv:2203.04227v1 [cs.IT])
    Unmanned Aerial Vehicles (UAVs) are used as aerial base-stations to relay time-sensitive packets from IoT devices to the nearby terrestrial base-station (TBS). Scheduling of packets in such UAV-relayed IoT-networks to ensure fresh (or up-to-date) IoT devices' packets at the TBS is a challenging problem as it involves two simultaneous steps of (i) sampling of packets generated at IoT devices by the UAVs [hop-1] and (ii) updating of sampled packets from UAVs to the TBS [hop-2]. To address this, we propose Age-of-Information (AoI) scheduling algorithms for two-hop UAV-relayed IoT-networks. First, we propose a low-complexity AoI scheduler, termed, MAF-MAD that employs Maximum AoI First (MAF) policy for sampling of IoT devices at UAV (hop-1) and Maximum AoI Difference (MAD) policy for updating sampled packets from UAV to the TBS (hop-2). We prove that MAF-MAD is the optimal AoI scheduler under ideal conditions (lossless wireless channels and generate-at-will traffic-generation at IoT devices). On the contrary, for general conditions (lossy channel conditions and varying periodic traffic-generation at IoT devices), a deep reinforcement learning algorithm, namely, Proximal Policy Optimization (PPO)-based scheduler is proposed. Simulation results show that the proposed PPO-based scheduler outperforms other schedulers like MAF-MAD, MAF, and round-robin in all considered general scenarios.  ( 2 min )
    WaveMix: Resource-efficient Token Mixing for Images. (arXiv:2203.03689v1 [cs.CV])
    Although certain vision transformer (ViT) and CNN architectures generalize well on vision tasks, it is often impractical to use them on green, edge, or desktop computing due to their computational requirements for training and even testing. We present WaveMix as an alternative neural architecture that uses a multi-scale 2D discrete wavelet transform (DWT) for spatial token mixing. Unlike ViTs, WaveMix neither unrolls the image nor requires self-attention of quadratic complexity. Additionally, DWT introduces another inductive bias -- besides convolutional filtering -- to utilize the 2D structure of an image to improve generalization. The multi-scale nature of the DWT also reduces the requirement for a deeper architecture compared to the CNNs, as the latter relies on pooling for partial spatial mixing. WaveMix models show generalization that is competitive with ViTs, CNNs, and token mixers on several datasets while requiring lower GPU RAM (training and testing), number of computations, and storage. WaveMix have achieved State-of-the-art (SOTA) results in EMNIST Byclass and EMNIST Balanced datasets.  ( 2 min )
    Nonlinear Isometric Manifold Learning for Injective Normalizing Flows. (arXiv:2203.03934v1 [cs.LG])
    To model manifold data using normalizing flows, we propose to employ the isometric autoencoder to design nonlinear encodings with explicit inverses. The isometry allows us to separate manifold learning and density estimation and train both parts to high accuracy. Applied to the MNIST data set, the combined approach generates high-quality images.  ( 2 min )
    Motron: Multimodal Probabilistic Human Motion Forecasting. (arXiv:2203.04132v1 [cs.CV])
    Autonomous systems and humans are increasingly sharing the same space. Robots work side by side or even hand in hand with humans to balance each other's limitations. Such cooperative interactions are ever more sophisticated. Thus, the ability to reason not just about a human's center of gravity position, but also its granular motion is an important prerequisite for human-robot interaction. Though, many algorithms ignore the multimodal nature of humans or neglect uncertainty in their motion forecasts. We present Motron, a multimodal, probabilistic, graph-structured model, that captures human's multimodality using probabilistic methods while being able to output deterministic motions and corresponding confidence values for each mode. Our model aims to be tightly integrated with the robotic planning-control-interaction loop; outputting physically feasible human motions and being computationally efficient. We demonstrate the performance of our model on several challenging real-world motion forecasting datasets, outperforming a wide array of generative methods while providing state-of-the-art deterministic motions if required. Both using significantly less computational power than state-of-the art algorithms.  ( 2 min )
    Responsible AI in Healthcare. (arXiv:2203.03616v1 [cs.CY])
    This article discusses open problems, implemented solutions, and future research in the area of responsible AI in healthcare. In particular, we illustrate two main research themes related to the work of two laboratories within the Department of Informatics, Systems, and Communication at the University of Milano-Bicocca. The problems addressed concern, in particular, {uncertainty in medical data and machine advice}, and the problem of online health information disorder.  ( 2 min )
    Detection of AI Synthesized Hindi Speech. (arXiv:2203.03706v1 [cs.SD])
    The recent advancements in generative artificial speech models have made possible the generation of highly realistic speech signals. At first, it seems exciting to obtain these artificially synthesized signals such as speech clones or deep fakes but if left unchecked, it may lead us to digital dystopia. One of the primary focus in audio forensics is validating the authenticity of a speech. Though some solutions are proposed for English speeches but the detection of synthetic Hindi speeches have not gained much attention. Here, we propose an approach for discrimination of AI synthesized Hindi speech from an actual human speech. We have exploited the Bicoherence Phase, Bicoherence Magnitude, Mel Frequency Cepstral Coefficient (MFCC), Delta Cepstral, and Delta Square Cepstral as the discriminating features for machine learning models. Also, we extend the study to using deep neural networks for extensive experiments, specifically VGG16 and homemade CNN as the architecture models. We obtained an accuracy of 99.83% with VGG16 and 99.99% with homemade CNN models.  ( 2 min )
    Dual Lottery Ticket Hypothesis. (arXiv:2203.04248v1 [cs.LG])
    Fully exploiting the learning capacity of neural networks requires overparameterized dense networks. On the other side, directly training sparse neural networks typically results in unsatisfactory performance. Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. Concretely, it claims there exist winning tickets from a randomly initialized network found by iterative magnitude pruning and preserving promising trainability (or we say being in trainable condition). In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark, then go from a complementary direction to articulate the Dual Lottery Ticket Hypothesis (DLTH): Randomly selected subnetworks from a randomly initialized dense network can be transformed into a trainable condition and achieve admirable performance compared with LTH -- random tickets in a given lottery pool can be transformed into winning tickets. Specifically, by using uniform-randomly selected subnetworks to represent the general cases, we propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH. Concretely, we introduce a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked. After finishing the transformation for the randomly selected subnetworks, we conduct the regular finetuning to evaluate the model using fair comparisons with LTH and other strong baselines. Extensive experiments on several public datasets and comparisons with competitive approaches validate our DLTH as well as the effectiveness of the proposed model RST. Our work is expected to pave a way for inspiring new research directions of sparse network training in the future. Our code is available at https://github.com/yueb17/DLTH.  ( 2 min )
    Enhancing Mechanical Metamodels with a Generative Model-Based Augmented Training Dataset. (arXiv:2203.04183v1 [cs.LG])
    Modeling biological soft tissue is complex in part due to material heterogeneity. Microstructural patterns, which play a major role in defining the mechanical behavior of these tissues, are both challenging to characterize, and difficult to simulate. Recently, machine learning-based methods to predict the mechanical behavior of heterogeneous materials have made it possible to more thoroughly explore the massive input parameter space associated with heterogeneous blocks of material. Specifically, we can train machine learning (ML) models to closely approximate computationally expensive heterogeneous material simulations where the ML model is trained on a dataset of simulations that capture the range of spatial heterogeneity present in the material of interest. However, when it comes to applying these techniques to biological tissue more broadly, there is a major limitation: the relevant microstructural patterns are both challenging to obtain and difficult to analyze. Consequently, the number of useful examples available to characterize the input domain under study is limited. In this work, we investigate the efficacy of ML-based generative models as a tool for augmenting limited input pattern datasets. We find that a Style-based Generative Adversarial Network with an adaptive discriminator augmentation mechanism is able to successfully leverage just 1,000 example patterns to create meaningful generated patterns that can be used as inputs to finite element simulations to augment the training dataset. To enable this methodological contribution, we have created an open access dataset of Finite Element Analysis simulations based on Cahn-Hilliard patterns. We anticipate that future researchers will be able to leverage this dataset and build on the work presented here.  ( 2 min )
    Unsupervised Domain Adaptation with Contrastive Learning for OCT Segmentation. (arXiv:2203.03664v1 [cs.CV])
    Accurate segmentation of retinal fluids in 3D Optical Coherence Tomography images is key for diagnosis and personalized treatment of eye diseases. While deep learning has been successful at this task, trained supervised models often fail for images that do not resemble labeled examples, e.g. for images acquired using different devices. We hereby propose a novel semi-supervised learning framework for segmentation of volumetric images from new unlabeled domains. We jointly use supervised and contrastive learning, also introducing a contrastive pairing scheme that leverages similarity between nearby slices in 3D. In addition, we propose channel-wise aggregation as an alternative to conventional spatial-pooling aggregation for contrastive feature map projection. We evaluate our methods for domain adaptation from a (labeled) source domain to an (unlabeled) target domain, each containing images acquired with different acquisition devices. In the target domain, our method achieves a Dice coefficient 13.8% higher than SimCLR (a state-of-the-art contrastive framework), and leads to results comparable to an upper bound with supervised training in that domain. In the source domain, our model also improves the results by 5.4% Dice, by successfully leveraging information from many unlabeled images.  ( 2 min )
    Mammograms Classification: A Review. (arXiv:2203.03618v1 [eess.IV])
    An advanced reliable low-cost form of screening method, Digital mammography has been used as an effective imaging method for breast cancer detection. With an increased focus on technologies to aid healthcare, Mammogram images have been utilized in developing computer-aided diagnosis systems that will potentially help in clinical diagnosis. Researchers have proved that artificial intelligence with its emerging technologies can be used in the early detection of the disease and improve radiologists' performance in assessing breast cancer. In this paper, we review the methods developed for mammogram mass classification in two categories. The first one is classifying manually provided cropped region of interests (ROI) as either malignant or benign, and the second one is the classification of automatically segmented ROIs as either malignant or benign. We also provide an overview of datasets and evaluation metrics used in the classification task. Finally, we compare and discuss the deep learning approach to classical image processing and learning approach in this domain.  ( 2 min )
    Clustering and classification of low-dimensional data in explicit feature map domain: intraoperative pixel-wise diagnosis of adenocarcinoma of a colon in a liver. (arXiv:2203.03636v1 [eess.IV])
    Application of artificial intelligence in medicine brings in highly accurate predictions achieved by complex models, the reasoning of which is hard to interpret. Their generalization ability can be reduced because of the lack of pixel wise annotated images that occurs in frozen section tissue analysis. To partially overcome this gap, this paper explores the approximate explicit feature map (aEFM) transform of low-dimensional data into a low-dimensional subspace in Hilbert space. There, with a modest increase in computational complexity, linear algorithms yield improved performance and keep interpretability. They remain amenable to incremental learning that is not a trivial issue for some nonlinear algorithms. We demonstrate proposed methodology on a very large-scale problem related to intraoperative pixel-wise semantic segmentation and clustering of adenocarcinoma of a colon in a liver. Compared to the results in the input space, logistic classifier achieved statistically significant performance improvements in micro balanced accuracy and F1 score in the amounts of 12.04% and 12.58%, respectively. Support vector machine classifier yielded the increase of 8.04% and 9.41%. For clustering, increases of 0.79% and 0.85% are obtained with ultra large-scale spectral clustering algorithm. Results are supported by a discussion of interpretability using Shapely additive explanation values for predictions of linear classifier in input space and aEFM induced space.  ( 2 min )
    A Typology to Explore and Guide Explanatory Interactive Machine Learning. (arXiv:2203.03668v1 [cs.LG])
    Recently, more and more eXplanatory Interactive machine Learning (XIL) methods have been proposed with the goal of extending a model's learning process by integrating human user supervision on the model's explanations. These methods were often developed independently, provide different motivations and stem from different applications. Notably, up to now, there has not been a comprehensive evaluation of these works. By identifying a common set of basic modules and providing a thorough discussion of these modules, our work, for the first time, comes up with a unification of the various methods into a single typology. This typology can thus be used to categorize existing and future XIL methods based on the identified modules. Moreover, our work contributes by surveying six existing XIL methods. In addition to benchmarking these methods on their overall ability to revise a model, we perform additional benchmarks regarding wrong reason revision, interaction efficiency, robustness to feedback quality, and the ability to revise a strongly corrupted model. Apart from introducing these novel benchmarking tasks, for improved quantitative evaluations, we further introduce a novel Wrong Reason (\wrnospace) metric which measures the average wrong reason activation in a model's explanations to complement a qualitative inspection. In our evaluations, all methods prove to revise a model successfully. However, we found significant differences between the methods on individual benchmark tasks, revealing valuable application-relevant aspects not only for comparing current methods but also to motivate the necessity of incorporating these benchmarks in the development of future XIL methods.  ( 2 min )
  • Open

    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v3 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.  ( 2 min )
    Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation. (arXiv:2203.04192v1 [cs.LG])
    Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.  ( 2 min )
    Any Variational Autoencoder Can Do Arbitrary Conditioning. (arXiv:2201.12414v2 [cs.LG] UPDATED)
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables any Variational Autoencoder (VAE) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching achieves performance that is comparable or superior to current state-of-the-art methods for a variety of tasks.  ( 2 min )
    Bayesian community detection for networks with covariates. (arXiv:2203.02090v1 [stat.ME] CROSS LISTED)
    The increasing prevalence of network data in a vast variety of fields and the need to extract useful information out of them have spurred fast developments in related models and algorithms. Among the various learning tasks with network data, community detection, the discovery of node clusters or "communities," has arguably received the most attention in the scientific community. In many real-world applications, the network data often come with additional information in the form of node or edge covariates that should ideally be leveraged for inference. In this paper, we add to a limited literature on community detection for networks with covariates by proposing a Bayesian stochastic block model with a covariate-dependent random partition prior. Under our prior, the covariates are explicitly expressed in specifying the prior distribution on the cluster membership. Our model has the flexibility of modeling uncertainties of all the parameter estimates including the community membership. Importantly, and unlike the majority of existing methods, our model has the ability to learn the number of the communities via posterior inference without having to assume it to be known. Our model can be applied to community detection in both dense and sparse networks, with both categorical and continuous covariates, and our MCMC algorithm is very efficient with good mixing properties. We demonstrate the superior performance of our model over existing models in a comprehensive simulation study and an application to two real datasets.  ( 2 min )
    A Review of the Gumbel-max Trick and its Extensions for Discrete Stochasticity in Machine Learning. (arXiv:2110.01515v2 [cs.LG] UPDATED)
    The Gumbel-max trick is a method to draw a sample from a categorical distribution, given by its unnormalized (log-)probabilities. Over the past years, the machine learning community has proposed several extensions of this trick to facilitate, e.g., drawing multiple samples, sampling from structured domains, or gradient estimation for error backpropagation in neural network optimization. The goal of this survey article is to present background about the Gumbel-max trick, and to provide a structured overview of its extensions to ease algorithm selection. Moreover, it presents a comprehensive outline of (machine learning) literature in which Gumbel-based algorithms have been leveraged, reviews commonly-made design choices, and sketches a future perspective.  ( 2 min )
    PERCEPT: a new online change-point detection method using topological data analysis. (arXiv:2203.04246v1 [stat.ME])
    Topological data analysis (TDA) provides a set of data analysis tools for extracting embedded topological structures from complex high-dimensional datasets. In recent years, TDA has been a rapidly growing field which has found success in a wide range of applications, including signal processing, neuroscience and network analysis. In these applications, the online detection of changes is of crucial importance, but this can be highly challenging since such changes often occur in a low-dimensional embedding within high-dimensional data streams. We thus propose a new method, called PERsistence diagram-based ChangE-PoinT detection (PERCEPT), which leverages the learned topological structure from TDA to sequentially detect changes. PERCEPT follows two key steps: it first learns the embedded topology as a point cloud via persistence diagrams, then applies a non-parametric monitoring approach for detecting changes in the resulting point cloud distributions. This yields a non-parametric, topology-aware framework which can efficiently detect online changes from high-dimensional data streams. We investigate the effectiveness of PERCEPT over existing methods in a suite of numerical experiments where the data streams have an embedded topological structure. We then demonstrate the usefulness of PERCEPT in two applications in solar flare monitoring and human gesture detection.  ( 2 min )
    Universal Graph Transformer Self-Attention Networks. (arXiv:1909.11855v13 [cs.LG] UPDATED)
    We introduce a transformer-based GNN model, named UGformer, to learn graph representations. In particular, we present two UGformer variants, wherein the first variant (publicized in September 2019) is to leverage the transformer on a set of sampled neighbors for each input node, while the second (publicized in May 2021) is to leverage the transformer on all input nodes. Experimental results demonstrate that the first UGformer variant achieves state-of-the-art accuracies on benchmark datasets for graph classification in both inductive setting and unsupervised transductive setting; and the second UGformer variant obtains state-of-the-art accuracies for inductive text classification. The code is available at: \url{https://github.com/daiquocnguyen/Graph-Transformer}.  ( 2 min )
    AgraSSt: Approximate Graph Stein Statistics for Interpretable Assessment of Implicit Graph Generators. (arXiv:2203.03673v1 [stat.ML])
    We propose and analyse a novel statistical procedure, coined AgraSSt, to assess the quality of graph generators that may not be available in explicit form. In particular, AgraSSt can be used to determine whether a learnt graph generating process is capable of generating graphs that resemble a given input graph. Inspired by Stein operators for random graphs, the key idea of AgraSSt is the construction of a kernel discrepancy based on an operator obtained from the graph generator. AgraSSt can provide interpretable criticisms for a graph generator training procedure and help identify reliable sample batches for downstream tasks. Using Stein`s method we give theoretical guarantees for a broad class of random graph models. We provide empirical results on both synthetic input graphs with known graph generation procedures, and real-world input graphs that the state-of-the-art (deep) generative models for graphs are trained on.  ( 2 min )
    The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning. (arXiv:2203.03761v1 [cs.LG])
    We consider the problem of training a $d$ dimensional model with distributed differential privacy (DP) where secure aggregation (SecAgg) is used to ensure that the server only sees the noisy sum of $n$ model updates in every training round. Taking into account the constraints imposed by SecAgg, we characterize the fundamental communication cost required to obtain the best accuracy achievable under $\varepsilon$ central DP (i.e. under a fully trusted server and no communication constraints). Our results show that $\tilde{O}\left( \min(n^2\varepsilon^2, d) \right)$ bits per client are both sufficient and necessary, and this fundamental limit can be achieved by a linear scheme based on sparse random projections. This provides a significant improvement relative to state-of-the-art SecAgg distributed DP schemes which use $\tilde{O}(d\log(d/\varepsilon^2))$ bits per client. Empirically, we evaluate our proposed scheme on real-world federated learning tasks. We find that our theoretical analysis is well matched in practice. In particular, we show that we can reduce the communication cost significantly to under $1.2$ bits per parameter in realistic privacy settings without decreasing test-time performance. Our work hence theoretically and empirically specifies the fundamental price of using SecAgg.  ( 2 min )
    Online Weak-form Sparse Identification of Partial Differential Equations. (arXiv:2203.03979v1 [math.OC])
    This paper presents an online algorithm for identification of partial differential equations (PDEs) based on the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy). The algorithm is online in a sense that if performs the identification task by processing solution snapshots that arrive sequentially. The core of the method combines a weak-form discretization of candidate PDEs with an online proximal gradient descent approach to the sparse regression problem. In particular, we do not regularize the $\ell_0$-pseudo-norm, instead finding that directly applying its proximal operator (which corresponds to a hard thresholding) leads to efficient online system identification from noisy data. We demonstrate the success of the method on the Kuramoto-Sivashinsky equation, the nonlinear wave equation with time-varying wavespeed, and the linear wave equation, in one, two, and three spatial dimensions, respectively. In particular, our examples show that the method is capable of identifying and tracking systems with coefficients that vary abruptly in time, and offers a streaming alternative to problems in higher dimensions.  ( 2 min )
    SDP Achieves Exact Minimax Optimality in Phase Synchronization. (arXiv:2101.02347v3 [math.ST] UPDATED)
    We study the phase synchronization problem with noisy measurements $Y=z^*z^{*H}+\sigma W\in\mathbb{C}^{n\times n}$, where $z^*$ is an $n$-dimensional complex unit-modulus vector and $W$ is a complex-valued Gaussian random matrix. It is assumed that each entry $Y_{jk}$ is observed with probability $p$. We prove that an SDP relaxation of the MLE achieves the error bound $(1+o(1))\frac{\sigma^2}{2np}$ under a normalized squared $\ell_2$ loss. This result matches the minimax lower bound of the problem, and even the leading constant is sharp. The analysis of the SDP is based on an equivalent non-convex programming whose solution can be characterized as a fixed point of the generalized power iteration lifted to a higher dimensional space. This viewpoint unifies the proofs of the statistical optimality of three different methods: MLE, SDP, and generalized power method. The technique is also applied to the analysis of the SDP for $\mathbb{Z}_2$ synchronization, and we achieve the minimax optimal error $\exp\left(-(1-o(1))\frac{np}{2\sigma^2}\right)$ with a sharp constant in the exponent.  ( 2 min )
    Variational Nearest Neighbor Gaussian Processes. (arXiv:2202.01694v2 [cs.LG] UPDATED)
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.  ( 2 min )
    A Sharp Characterization of Linear Estimators for Offline Policy Evaluation. (arXiv:2203.04236v1 [cs.LG])
    Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation (OPE), unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of OPE.  ( 2 min )
    Optimum-statistical Collaboration Towards General and Efficient Black-box Optimization. (arXiv:2106.09215v3 [stat.ML] UPDATED)
    In this paper, we make the key delineation on the roles of resolution and statistical uncertainty in hierarchical bandits-based black-box optimization algorithms, guiding a more general analysis and a more efficient algorithm design. We introduce the \textit{optimum-statistical collaboration}, an algorithm framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. We provide a general analysis of this framework without specifying the forms of statistical error and uncertainty quantifier. Our framework and its analysis, due to their generality, can be applied to a large family of functions and partitions that satisfy different local smoothness assumptions and have different numbers of local optimums, which is much richer than the class of functions studied in prior works. Our framework also inspires us to propose a better measure of the statistical uncertainty and consequently a variance-adaptive algorithm \texttt{VHCT}. In theory, we prove the algorithm enjoys rate-optimal regret bounds under different local smoothness assumptions; in experiments, we show the algorithm outperforms prior efforts in different settings.  ( 2 min )
    Tight bounds for minimum l1-norm interpolation of noisy data. (arXiv:2111.05987v2 [math.ST] UPDATED)
    We provide matching upper and lower bounds of order $\sigma^2/\log(d/n)$ for the prediction error of the minimum $\ell_1$-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when $d \gg n$, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum $\ell_2$-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.  ( 2 min )
    Outlier detection in multivariate functional data through a contaminated mixture model. (arXiv:2106.07222v2 [math.ST] UPDATED)
    In an industrial context, the activity of sensors is recorded at a high frequency. A challenge is to automatically detect abnormal measurement behavior. Considering the sensor measures as functional data, the problem can be formulated as the detection of outliers in a multivariate functional data set. Due to the heterogeneity of this data set, the proposed contaminated mixture model both clusters the multivariate functional data into homogeneous groups and detects outliers. The main advantage of this procedure over its competitors is that it does not require to specify the proportion of outliers. Model inference is performed through an Expectation-Conditional Maximization algorithm, and the BIC is used to select the number of clusters. Numerical experiments on simulated data demonstrate the high performance achieved by the inference algorithm. In particular, the proposed model outperforms the competitors. Its application on the real data which motivated this study allows to correctly detect abnormal behaviors.  ( 2 min )
    Variational methods for simulation-based inference. (arXiv:2203.04176v1 [stat.ML])
    We present Sequential Neural Variational Inference (SNVI), an approach to perform Bayesian inference in models with intractable likelihoods. SNVI combines likelihood-estimation (or likelihood-ratio-estimation) with variational inference to achieve a scalable simulation-based inference approach. SNVI maintains the flexibility of likelihood(-ratio) estimation to allow arbitrary proposals for simulations, while simultaneously providing a functional estimate of the posterior distribution without requiring MCMC sampling. We present several variants of SNVI and demonstrate that they are substantially more computationally efficient than previous algorithms, without loss of accuracy on benchmark tasks. We apply SNVI to a neuroscience model of the pyloric network in the crab and demonstrate that it can infer the posterior distribution with one order of magnitude fewer simulations than previously reported. SNVI vastly reduces the computational cost of simulation-based inference while maintaining accuracy and flexibility, making it possible to tackle problems that were previously inaccessible.  ( 2 min )
    Estimating the average causal effect of intervention in continuous variables using machine learning. (arXiv:2203.03916v1 [stat.ML])
    The most widely discussed methods for estimating the Average Causal Effect / Average Treatment Effect are those for intervention in discrete binary variables whose value represents the intervention / non-intervention groups. On the other hand, methods for intervening in continuous variables independent of the data generating model has not been developed. In this study, we give a method for estimating the average causal effect for intervention in continuous variables that can be applied to data of any generating model as long as the causal effect is identifiable. The proposing method is independent of machine learning algorithms and preserves the identifiability of the data.  ( 2 min )
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v2 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.  ( 2 min )
    Distributed Learning of Deep Neural Networks using Independent Subnet Training. (arXiv:1910.02120v6 [cs.LG] UPDATED)
    Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.  ( 3 min )
    A Fast Scale-Invariant Algorithm for Non-negative Least Squares with Non-negative Data. (arXiv:2203.03808v1 [math.OC])
    Nonnegative (linear) least square problems are a fundamental class of problems that is well-studied in statistical learning and for which solvers have been implemented in many of the standard programming languages used within the machine learning community. The existing off-the-shelf solvers view the non-negativity constraint in these problems as an obstacle and, compared to unconstrained least squares, perform additional effort to address it. However, in many of the typical applications, the data itself is nonnegative as well, and we show that the nonnegativity in this case makes the problem easier. In particular, while the oracle complexity of unconstrained least squares problems necessarily scales with one of the data matrix constants (typically the spectral norm) and these problems are solved to additive error, we show that nonnegative least squares problems with nonnegative data are solvable to multiplicative error and with complexity that is independent of any matrix constants. The algorithm we introduce is accelerated and based on a primal-dual perspective. We further show how to provably obtain linear convergence using adaptive restart coupled with our method and demonstrate its effectiveness on large-scale data via numerical experiments.  ( 2 min )
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v4 [cs.LG] UPDATED)
    Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.  ( 2 min )
    RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests. (arXiv:2106.08217v2 [stat.ML] UPDATED)
    Like many predictive models, random forests provide point predictions for new observations. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests (PIBF) and 15 method variations to produce prediction intervals with random forests, as proposed by Roy and Larocque (2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods for building prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, outperforms competing methods.  ( 2 min )
    Nonconvex Matrix Completion with Linearly Parameterized Factors. (arXiv:2003.13153v2 [math.ST] UPDATED)
    Techniques of matrix completion aim to impute a large portion of missing entries in a data matrix through a small portion of observed ones. In practice including collaborative filtering, prior information and special structures are usually employed in order to improve the accuracy of matrix completion. In this paper, we propose a unified nonconvex optimization framework for matrix completion with linearly parameterized factors. In particular, by introducing a condition referred to as Correlated Parametric Factorization, we can conduct a unified geometric analysis for the nonconvex objective by establishing uniform upper bounds for low-rank estimation resulting from any local minimum. Perhaps surprisingly, the condition of Correlated Parametric Factorization holds for important examples including subspace-constrained matrix completion and skew-symmetric matrix completion. The effectiveness of our unified nonconvex optimization method is also empirically illustrated by extensive numerical simulations.  ( 2 min )
    Flat minima generalize for low-rank matrix recovery. (arXiv:2203.03756v1 [cs.LG])
    Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We complete the paper with synthetic experiments that illustrate our findings.  ( 2 min )
    Convergence Rates for Gaussian Mixtures of Experts. (arXiv:1907.04377v2 [math.ST] UPDATED)
    We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of \emph{algebraic independence} of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation.  ( 2 min )
    Contrastive Conditional Neural Processes. (arXiv:2203.03978v1 [cs.LG])
    Conditional Neural Processes~(CNPs) bridge neural networks with probabilistic inference to approximate functions of Stochastic Processes under meta-learning settings. Given a batch of non-{\it i.i.d} function instantiations, CNPs are jointly optimized for in-instantiation observation prediction and cross-instantiation meta-representation adaptation within a generative reconstruction pipeline. There can be a challenge in tying together such two targets when the distribution of function observations scales to high-dimensional and noisy spaces. Instead, noise contrastive estimation might be able to provide more robust representations by learning distributional matching objectives to combat such inherent limitation of generative models. In light of this, we propose to equip CNPs by 1) aligning prediction with encoded ground-truth observation, and 2) decoupling meta-representation adaptation from generative reconstruction. Specifically, two auxiliary contrastive branches are set up hierarchically, namely in-instantiation temporal contrastive learning~({\tt TCL}) and cross-instantiation function contrastive learning~({\tt FCL}), to facilitate local predictive alignment and global function consistency, respectively. We empirically show that {\tt TCL} captures high-level abstraction of observations, whereas {\tt FCL} helps identify underlying functions, which in turn provides more efficient representations. Our model outperforms other CNPs variants when evaluating function distribution reconstruction and parameter identification across 1D, 2D and high-dimensional time-series.  ( 2 min )
    Strength of Minibatch Noise in SGD. (arXiv:2102.05375v3 [cs.LG] UPDATED)
    The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.  ( 2 min )
    Valid prediction intervals for regression problems. (arXiv:2107.00363v3 [stat.ML] UPDATED)
    Over the last few decades, various methods have been proposed for estimating prediction intervals in regression settings, including Bayesian methods, ensemble methods, direct interval estimation methods and conformal prediction methods. An important issue is the calibration of these methods: the generated prediction intervals should have a predefined coverage level, without being overly conservative. In this work, we review the above four classes of methods from a conceptual and experimental point of view. Results on benchmark data sets from various domains highlight large fluctuations in performance from one data set to another. These observations can be attributed to the violation of certain assumptions that are inherent to some classes of methods. We illustrate how conformal prediction can be used as a general calibration procedure for methods that deliver poor results without a calibration step.  ( 2 min )
    ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models. (arXiv:2106.12248v3 [cs.LG] UPDATED)
    Frequently, population studies feature pyramidally-organized data represented using Hierarchical Bayesian Models (HBM) enriched with plates. These models can become prohibitively large in settings such as neuroimaging, where a sample is composed of a functional MRI signal measured on 300 brain locations, across 4 measurement sessions, and 30 subjects, resulting in around 1 million latent parameters.Such high dimensionality hampers the usage of modern, expressive flow-based techniques.To infer parameter posterior distributions in this challenging class of problems, we designed a novel methodology that automatically produces a variational family dual to a target HBM. This variational family, represented as a neural network, consists in the combination of an attention-based hierarchical encoder feeding summary statistics to a set of normalizing flows. Our automatically-derived neural network exploits exchangeability in the plate-enriched HBM and factorizes its parameter space. The resulting architecture reduces by orders of magnitude its parameterization with respect to that of a typical flow-based representation, while maintaining expressivity.Our method performs inference on the specified HBM in an amortized setup: once trained, it can readily be applied to a new data sample to compute the parameters' full posterior.We demonstrate the capability and scalability of our method on simulated data, as well as a challenging high-dimensional brain parcellation experiment. We also open up several questions that lie at the intersection between normalizing flows, SBI, structured Variational Inference, and inference amortization.  ( 2 min )
    The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control. (arXiv:2110.06048v4 [stat.ME] UPDATED)
    We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated knockoff predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate probability distribution with existing finite expectation and variance. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. An open source R package containing the implementation of the T-Knock filter is available at https://github.com/jasinmachkour/tknock.  ( 2 min )
    Semi-Random Sparse Recovery in Nearly-Linear Time. (arXiv:2203.04002v1 [cs.DS])
    Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings. We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + \xi$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time. Our approach differs from prior fast iterative methods with provable guarantees under semi-random generative models: natural conditions on a submatrix which make sparse recovery tractable are NP-hard to verify. We design a new iterative method tailored to the geometry of sparse recovery which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for natural statistical inverse problems.  ( 2 min )
    Mixture of Conditional Gaussian Graphical Models for unlabelled heterogeneous populations in the presence of co-factors. (arXiv:2006.11094v4 [math.ST] UPDATED)
    Conditional correlation networks, within Gaussian Graphical Models (GGM), are widely used to describe the direct interactions between the components of a random vector. In the case of an unlabelled Heterogeneous population, Expectation Maximisation (EM) algorithms for Mixtures of GGM have been proposed to estimate both each sub-population's graph and the class labels. However, we argue that, with most real data, class affiliation cannot be described with a Mixture of Gaussian, which mostly groups data points according to their geometrical proximity. In particular, there often exists external co-features whose values affect the features' average value, scattering across the feature space data points belonging to the same sub-population. Additionally, if the co-features' effect on the features is Heterogeneous, then the estimation of this effect cannot be separated from the sub-population identification. In this article, we propose a Mixture of Conditional GGM (CGGM) that subtracts the heterogeneous effects of the co-features to regroup the data points into sub-population corresponding clusters. We develop a penalised EM algorithm to estimate graph-sparse model parameters. We demonstrate on synthetic and real data how this method fulfils its goal and succeeds in identifying the sub-populations where the Mixtures of GGM are disrupted by the effect of the co-features.  ( 2 min )
    Confidence intervals for the random forest generalization error. (arXiv:2112.06101v2 [stat.ML] UPDATED)
    We show that the byproducts of the standard training process of a random forest yield not only the well known and almost computationally free out-of-bag point estimate of the model generalization error, but also give a direct path to compute confidence intervals for the generalization error which avoids processes of data splitting and model retraining. Besides the low computational cost involved in their construction, these confidence intervals are shown through simulations to have good coverage and appropriate shrinking rate of their width in terms of the training sample size.  ( 2 min )
    On the intrinsic dimensionality of Covid-19 data: a global perspective. (arXiv:2203.04165v1 [stat.AP])
    This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. We identify that the Covid-19 dataset may project onto two low-dimensional manifolds without significant information loss. The low dimensionality suggests strong dependency among the standardised growth rates of cases and deaths per capita and the OxCGRT Covid-19 Stringency Index for a country over 2020-2021. Given the low dimensional structure, it may be feasible to model observable Covid-19 dynamics with few parameters. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. Moreover, we highlight that high-income countries are more likely to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from Covid-19. Finally, we temporally stratify the dataset to examine the intrinsic dimension at a more granular level throughout the Covid-19 pandemic.  ( 2 min )
    Data adaptive RKHS Tikhonov regularization for learning kernels in operators. (arXiv:2203.03791v1 [stat.ML])
    We present DARTR: a Data Adaptive RKHS Tikhonov Regularization method for the linear inverse problem of nonparametric learning of function parameters in operators. A key ingredient is a system intrinsic data-adaptive (SIDA) RKHS, whose norm restricts the learning to take place in the function space of identifiability. DARTR utilizes this norm and selects the regularization parameter by the L-curve method. We illustrate its performance in examples including integral operators, nonlinear operators and nonlocal operators with discrete synthetic data. Numerical results show that DARTR leads to an accurate estimator robust to both numerical error due to discrete data and noise in data, and the estimator converges at a consistent rate as the data mesh refines under different levels of noises, outperforming two baseline regularizers using $l^2$ and $L^2$ norms.  ( 2 min )
    Information Theoretic Structured Generative Modeling. (arXiv:2110.05794v2 [cs.LG] UPDATED)
    R\'enyi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that R\'enyi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on R\'enyi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks.  ( 2 min )
    Noisy Low-rank Matrix Optimization: Geometry of Local Minima and Convergence Rate. (arXiv:2203.03899v1 [math.OC])
    This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sense has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results are not able to handle the problem with a general objective function subject to noisy data. In this paper, we address this problem by developing a mathematical framework that can deal with random corruptions to general objective functions, where the noise model is arbitrary. We prove that as long as the RIP constant of the noiseless objective is less than $1/3$, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than $1/3$. This paper offers the first set of results on the global and local optimization landscapes of general low-rank optimization problems under arbitrary random corruptions.  ( 2 min )

  • Open

    The long game: Desiloed systems and feedback loops by design (I of II)
    Data management (DM) discussions can be frustrating because both those feeling the pain and the consultants who try to help them are–90+ percent of the time, it seems–still using the same old ways. Those ways only go so far, and won’t go any farther. That’s because those who reinforce the old ways assume that what… Read More »The long game: Desiloed systems and feedback loops by design (I of II) The post The long game: Desiloed systems and feedback loops by design (I of II) appeared first on Data Science Central.  ( 5 min )
    How the Internet of Things Empowers CAD
    Computer-Aided Design (CAD) has become an essential tool in a great variety of industries. Software solutions like AutoCAD are extensively used across the business spectrum. From AEC projects to product design — CAD software has revolutionized the way engineers and designers work.   Of course, as CAD and other modern technologies keep on evolving, this creates… Read More »How the Internet of Things Empowers CAD The post How the Internet of Things Empowers CAD appeared first on Data Science Central.  ( 5 min )
    Decisions Part 2: Creating a Culture of Analytics-driven Innovation
    In part 1 of this multi-part series on the value-realizing, collaborative power of decisions, we introduced the concept of an AI-driven decision factory. An AI-power decision factory leverages an AI-enabled and human-empowered culture to continuously explore, learn, adapt, unlearn, and re-learn at the customer and operational frontlines of the organization. An AI-powered decision factory facilitates… Read More »Decisions Part 2: Creating a Culture of Analytics-driven Innovation The post Decisions Part 2: Creating a Culture of Analytics-driven Innovation appeared first on Data Science Central.  ( 7 min )
    DSC Weekly Digest 08 March 2022: Beware of Wishful Thinking
    Projects fail. There are many reasons why they do, but a surprising number of them come down to one or more variations of the “Wishful Thinking” theme. From a data science standpoint, this is usually referred to as making faulty assumptions, but the idea is the same. And with very few exceptions, the assumptions being… Read More »DSC Weekly Digest 08 March 2022: Beware of Wishful Thinking The post DSC Weekly Digest 08 March 2022: Beware of Wishful Thinking appeared first on Data Science Central.  ( 5 min )
  • Open

    How ML Hybrid Parser Beats Traditional Parser
    submitted by /u/Kagermanov [link] [comments]
    DeepMind’s new AI model helps decipher, date, and locate ancient inscriptions
    submitted by /u/nick7566 [link] [comments]
    LinkedIn Open-Sources The Implementation of PASS (Performance-Adaptive Sampling Strategy) For Deep Learning
    Understanding the relationships between entity sets maintained in a database is critical. In this context, an entity is an object or a data component. Entity relationships are many times depicted using graphs in a variety of ways. For instance, professional graphs show how people collaborate, whereas social graphs show how people connect with one another. To better utilize the graphs, deep learning models called GNNs (Graph Neural Networks) are taught to interpret graphs. GNNs, for example, look at a member’s connections and connections of connections. They then use this neighborhood knowledge to do AI tasks like search and recommendation. However, GNNs have several limitations in terms of how they exploit a member’s neighbors. A GNN-based strategy, for starters, does not scale to real-world social networks. In many circumstances, a single member has many connections, and utilizing all of them is impractical. A celebrity, for example, could have hundreds of millions of connections. The fact that not every connection is relevant to the work at hand is a second barrier. For example, in the job recommendation task, the members connections who work in quite different industries and may be personal friends would be irrelevant to the task. Continue Reading My Summary On This Research Paper: https://dl.acm.org/doi/pdf/10.1145/3447548.3467284 Github: https://github.com/linkedin/PASS-GNN https://preview.redd.it/9woztlslhem81.png?width=772&format=png&auto=webp&s=a28ef90e8394e414ca0b1a7e2ab76a76c65aba95 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    How do I use the sigmoid derivative correctly?
    See here for info. submitted by /u/UnityPlum [link] [comments]
    Italy Fined Clearview AI €20 Million For Illegally Using its Citizens Data
    submitted by /u/harshsharma9619 [link] [comments]
    With AI, Meta Plans to Remove Language Barriers in the Metaverse (short audio clip from Mark Zuckerberg)
    submitted by /u/frog9913 [link] [comments]
    The internet as we know it is just a huge ad network that belongs to the states
    some concerns I wanted to share on Twitter today but got banned - then I got to post the fact I got banned on Twitter on Peepeth, and they wouldn't let me post my tx on chain because it violates their terms. I am now confident that reddit will flag this as well, but f you all U.S. hands. manifold put this in a nice perspective, saying that your protocol is your terms of service and not your fancy buzzwords on your landing page like nfts, web3 crypto best ico now buy win only! ~ so. they said you can't buy weapons with crypto cause they will know who you are, unless you're anonymous, or don't care. Eg. if you're a terrorist and buy a rocket launcher for BTC, they might know what you bought but can't really stop you. (unless you have a custodial wallet). what is literally happening at t…  ( 3 min )
    Low processing speed
    Hi everyone, I am trying to train my customized YOLO model on Google Colab using GPU, I'm using a very small dataset (200 images) and yet it's taking me so long (more than 10 hours of training) for only 2396 iterations, and my average loss is not getting better, any suggestions? ​ https://preview.redd.it/4y33zekx0dm81.png?width=1811&format=png&auto=webp&s=96a425c1b261c7426d129b48dac4d1f521131f8e submitted by /u/Lu3as33 [link] [comments]
    A Roadmap for Beginners in Machine Learning with many valuable resources for any ML workers or enthusiasts + how to stay up-to-date with news
    This guide is intended for anyone having zero or a small background in programming, maths, and machine learning. There is no specific order to follow, but a classic path would be from top to bottom. If you don't like reading books, skip it, if you don't want to follow an online course, you can skip it as well. There is not a single way to become a machine learning expert and with motivation, you can absolutely achieve it. The video: https://youtu.be/RirEw-uaS_8?list=PLO4GrDnQanVfb6Ins6up1xScJHl8YuwPQ The complete article: https://pub.towardsai.net/start-machine-learning-in-2020-become-an-expert-from-nothing-for-free-f31587630cf7 All the links on GitHub: https://github.com/louisfb01/start-machine-learning-in-2020 Artificial is a fantastic field, but it goes extremely fast. Don't miss out on the most important and exciting news by joining great communities, people, newsletters, and more you can all find in this guide! submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    North America extended its dominance for artificial intelligence (AI) hiring among ship industry companies in the three months ending January.. number of roles in NA made up 66.8 per cent of total AI jobs – up from 53 per cent in the same quarter in 2020.
    submitted by /u/dannylenwinn [link] [comments]  ( 1 min )
  • Open

    What is the most popular Neural Network architecture used in state-of-the-art model-based RL?
    What is the most popular NN model for model-based RL to predict the dynamics of the environment? (for continuous, high dimensional state spaces) submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
    Loved this paper on open problems in Cooperative AI
    submitted by /u/dwightschrute1905 [link] [comments]  ( 1 min )
    is there a defacto ranking of single-agent benchmarks by difficulty?
    basically the title. probably not able to answer comprehensibly since benchmarks vary in design and difficulty across a variety many factors (state space size, action space size, reward design, features, etc.) - but maybe broad generalizations can be made? something like "pendulum and mountain cart are the easiest, moonlander is somewhere in the middle, and atari is at the end"? submitted by /u/Farconion [link] [comments]  ( 1 min )
    Implementing Model-Based Policy Optimization (MBPO)
    Hi, I am interested in model-based reinforcement learning and currently working on a custom implementation of When to Trust Your Model: Model-Based Policy Optimization. I have a few general questions that apply to Dyna-style algorithms, i.e., algorithms that generate artificial data using a learned world model. Question #1: Should I train the policy on artificial data only, or mix both artificial and real environment samples? Question #2: Given more real data, the world model presumably improves over time. Should I re-generate the artificial data after each update to the world model? Question #3: The original MBPO paper uses an ensemble of world models, i.e., trains multiple instances to predict the forward dynamics. How are the models trained exactly? Should I sample bootstrap samples (with replacement) from the replay buffer of real experience at each step? Question #4: How are the models used to generate rollouts? At the moment, I choose a starting state from the replay buffer of real data and generate the entire model-based rollout with a random model of the ensemble. An alternative would be to randomly choose a model at each timestep, so that each step of the rollout could be generated using another predictive model. Any help would be greatly appreciated! submitted by /u/Internal-Brush4929 [link] [comments]  ( 1 min )
    Postdoc in the space of RL/ML/CompNeuro
    Dear all, I am looking to hire a full-time postdoctoral researcher in the space of (neuro-inspired) reinforcement learning to join my lab at Institute of Cognitive Science of the University of Osnabrück, Germany. The aim of this EU funded position is to explore RL in the domain of modelling visual processing in the primate brain. The ideal candidate will have gained solid experience with deep RL during their PhD, and will have background knowledge or a strong interest in neuro-/cognitive science questions and modelling in computational neuroscience. More information on our work here: https://www.kietzmannlab.org Some more information on the type of position/location here: https://twitter.com/TimKietzmann/status/1482027695856828417 Please contact me for more information if interested. Best Tim submitted by /u/sigh_ence [link] [comments]  ( 1 min )
    State transition problem in reinforcement learning
    I have used Q-learning to train the agent with finite discrete action and state space. Now assume a state s is a tuple of (A,B,C), when the agent takes an action, only A' and B' of the new state s'can be known. In this case, what kind of RL methods I can use to train the agent for discrete and continuous action/state space respectively? ​ I mean only part of the new state can be observed. ​ Thank you very much! submitted by /u/Backpackerxiaofeng [link] [comments]  ( 1 min )
    When should you use RL?
    Short of MCMC models, GLMs, GARCH, ARIMA and VAR, RL is really the only type ML I like. A lot of the other stuff is rarely based on any mathematical foundations and is never really interpretable. However, RL isn't really used a lot at least not outside autonomous cars, games, network, power grids and finance problems. I feel like other things often work better. Nonetheless, I'd still like to know when is it ideal to use RL. My first instinct would tell me that it is whenever you need optimization of some sort, but I doubt that's the only case. I'm a big fan of RL and will probably incorporate it at some rate in my personal projects. Really would like to know where it is appropriate. submitted by /u/HesperusIII [link] [comments]  ( 1 min )
  • Open

    Extract granular sentiment in text with Amazon Comprehend Targeted Sentiment
    Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. As a fully managed service, Amazon Comprehend requires no ML expertise and can scale to large volumes of data. Amazon Comprehend provides several different APIs to easily integrate NLP into your applications. You can simply call […]  ( 8 min )
    Amazon SageMaker Autopilot now supports time series data
    Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. We have recently announced support for time series data in Autopilot. You can use Autopilot to tackle regression and classification tasks on time series data, or sequence data […]  ( 5 min )
    Enable Amazon SageMaker JumpStart for custom IAM execution roles
    With an Amazon SageMaker Domain, you can onboard users with an AWS Identity and Access Management (IAM) execution role different than the Domain execution role. In such case, the onboarded Domain user can’t create projects using templates and Amazon SageMaker JumpStart solutions. This post outlines an automated approach to enable JumpStart for Domain users with […]  ( 6 min )
    Predict residential real estate prices at ImmoScout24 with Amazon SageMaker
    This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas Müller, AWS Solutions Architect. In 2010, ImmoScout24 released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the […]  ( 12 min )
    Transforming qualitative research by automating speech to text-to-text analytics
    This post is authored by Satish Jha, Intelligent Automation Manager, Matt Docherty, Data Science Manager, Jayesh Muley, Associate Consultant and Tapan Vora, Rapid Prototyping, from ZS Associates. At ZS Associates, we do a significant amount of qualitative market research. The work involves interviewing relevant subjects (such as healthcare professionals and sales representatives) and developing bespoke […]  ( 7 min )
  • Open

    [D] What are some top papers you recommend for beginners in deep learning?
    As the title says, what are some top papers in deep learning you recommend for people who want to jump in the deep learning field. Please only include the papers you have read yourself or plan to read soon. Do you have anything in mind which is also a review of the field? submitted by /u/SciBlend [link] [comments]  ( 1 min )
    [D] Extract from plain paper
    Extract all drawings from plain paper I want to extract all drawings/notes from a plain paper to project only the drawings/notes on a canvas in an app. So a 1:1 copy of the inkt. Should I use opencv for this or are there any extraction models? Thanks submitted by /u/dirk_klement [link] [comments]  ( 1 min )
    [D] Does this experiment make sense? (BERT, comparing text encoding)
    My goal is to compare different text encoding methods. My intent is to be rigorous, and keep everything as apples-to-apples as possible. We start with 10,000 comments from reddit, half from the /r/lotr sub, and half from /r/StarWars. Our goal is to build a simple classifier to predict which is which, and use that to compare how different inputs perform, all else being equal. Example 1: one-hot encoded The Data The Model (google colab) Example 2: word2vec sentence embeddings (dimension-wise average) The Data The Model (google colab) Same model, same data, all we change is the encoding scheme. Make sense so far? Now I want to use BERT embeddings and compare apples-to-apples. I am using bert-based-uncased via HuggingFace: auto_cls = AutoModelForSequenceClassification.from_pretrained…  ( 2 min )
    [D] AutoML vs AI Tables. Is it a new rival for Self-Service Machine Learning?
    Hey, machine learning experts! I represent a community-driven open source project called MindsDB (see on GitHub). We need your feedback about our concept for doing machine learning using SQL! It is called AI Tables and aims to democratize machine learning for all who work with data. There's an article on Medium with SQL commands examples. Please share your thoughts about it. Your true opinions will be a great contribution towards what a team of 25+ people are working on in the last 3 years! Thanks in advance! Costa submitted by /u/C0staTin [link] [comments]  ( 1 min )
    [D] What is current SOTA in Image to Image Translation?
    Hey guys just wondering if someone can point in the right direction with this one. I’m currently using cycleGAN for a specific task and was wondering if there are other better models out there. submitted by /u/DaBeastGeek [link] [comments]  ( 1 min )
    [P] Opytimizer: A Nature-Inspired Python Optimizer
    Opytimizer: A Nature-Inspired Python Optimizer Did you ever reach a bottleneck in your computational experiments? Are you tired of selecting suitable parameters for a chosen technique? If yes, Opytimizer is the real deal! This package provides an easy-to-go implementation of meta-heuristic optimizations. From agents to search space, from internal functions to external communication, we will foster all research related to optimizing stuff. Use Opytimizer if you need a library or wish to: * Create your optimization algorithm; * Design or use pre-loaded optimization tasks; * Mix-and-match different strategies to solve your problem; * Because it is fun to optimize things. Read the docs at opytimizer.readthedocs.io. Opytimizer is compatible with: Python 3.6+. How-To-Use: Minimal Example Take a look at a quick working example of Opytimizer. Note that we are not passing many extra arguments nor additional information to the procedure. For more complex examples, please check our examples/ folder. ```Python import numpy as np from opytimizer import Opytimizer from opytimizer.core import Function from opytimizer.optimizers.swarm import PSO from opytimizer.spaces import SearchSpace def sphere(x): return np.sum(x ** 2) n_agents = 20 n_variables = 2 lower_bound = [-10, -10] upper_bound = [10, 10] space = SearchSpace(n_agents, n_variables, lower_bound, upper_bound) optimizer = PSO() function = Function(sphere) opt = Opytimizer(space, optimizer, function) opt.start(n_iterations=1000) ``` submitted by /u/gthrosa [link] [comments]  ( 1 min )
    [P] NLP Transformers pipelines with ONNX
    Hello everyone, I did a post about using HF Transformers pipelines with ONNX. A convenient way to use Transformers NLP pipelines with ONNX models. Awesome TDS accepted to publish it. Hope you will find it useful and it will helps you accelerate your deployment of NLP applications. I am free to talk about it in the comments below, have a great day =) https://towardsdatascience.com/nlp-transformers-pipelines-with-onnx-9b890d015723 submitted by /u/ThomasChaigneau [link] [comments]
    [N] Hiring: Postdoctoral researcher in RL/ML/CompNeuro
    Dear all, I am looking to hire a full-time postdoctoral researcher in the space of (neuro-inspired) reinforcement learning to join my lab at Institute of Cognitive Science of the University of Osnabrück, Germany. The aim of this EU funded position is to explore RL in the domain of modelling visual processing in the primate brain. The ideal candidate will have gained solid experience with deep RL during their PhD, and will have background knowledge or a strong interest in neuro-/cognitive science questions and modelling in computational neuroscience. More information on our work here: https://www.kietzmannlab.org Some more information on the type of position/location here: https://twitter.com/TimKietzmann/status/1482027695856828417 Please contact me for more information if interested. Best Tim submitted by /u/sigh_ence [link] [comments]  ( 2 min )
    [D] how many machine learning engineers are there in 2022
    The report[https://www.theverge.com/2017/12/5/16737224/global-ai-talent-shortfall-tencent-report] by The Verge says "Tencent says there are only 300,000 AI engineers worldwide in 2017", Personally, I'm curious how many machine learning engineers in 2022, anyone have any idea? submitted by /u/joeytai1997 [link] [comments]  ( 1 min )
    [D][P] Do you use GPU cloud platform?
    Hi guys! We are an GPU cloud platform based on blockchain, we have a plan to support developers about GPU. We want to know what's the most important reason you choose GPU cloud platform if you are a user of GPU cloud platform. Compared to other GPU cloud platform, do your GPU cloud platform have any special function? It would be great to get some insight from people who know this and willing to share the comments on this! We will invite 3 of you to get a free GPU for 72 hours. THX View Poll submitted by /u/May-Feng [link] [comments]  ( 1 min )
    [Project] How complex of a pattern can unsupervised clustering learn?
    As an example: Say you have time series data and youre wanting to detect a pattern within this data. Say the pattern is complex and involves something like, every other number is a prime number, or every other number is a multiple of it's index, with non-patterned indexes containing noise. Could an unsupervised model detect the presence of a pattern like this in data? Say the training data either does or does not contain this pattern and your wish was to categorize it. Is this something too complex for unsupervised learning? I was thinking of using auto-encoders and then clustering, but maybe there's a better regressive model for this? How complex of a pattern is too complex to learn for this approach? submitted by /u/GreenTimbs [link] [comments]  ( 1 min )
    [P] Encoding of high cardinality categorical features in streaming data for anomaly detection - suggestions?
    Hello, /r/MachineLearning! I am working on a project where I am using Robust Random Cut Forest on incoming streaming data. The idea here is that I will have a continous flow of rather high cardinality data coming in. What I need is some suggestions to what kind of encoding approach would be best for some of the dimensions, and wonder if any of you guys have some experience working with machine learning on streaming data. To simplify a problem; each observations includes an identification of a person, and the number of persons is too high to one hot encode. I have also tried some different approaches with frequency encoding, because each person can show up multiple times. The things that causes problems here, are that multiple new unknown persons can come in. The same problem also applies to a couple of other dimensions aswell. If any one has some suggestions to what kind of encoding would be suited for this sort of thing, I would gladly check it out. I feel like I've been searching around for a couple of weeks now without getting anywhere near closer to what I'm trying to achieve. Note: Sorry for bad formatting and English is not my first language. submitted by /u/MatKrakyl [link] [comments]  ( 2 min )
    [P] Extract Local Attention Positions
    I am working on a project, I want to extract the local attention positions present in the feature maps of every level in the FPN. Below is the visualization: ​ https://preview.redd.it/y6t3koh91bm81.jpg?width=768&format=pjpg&auto=webp&s=7f5eee4229a35ec0c173b0417aa2a6eeb2c7c761 I am using MMDetection toolbox for editing. For example, a level in the FPN outputs the feature map that can be visualized below: ​ https://preview.redd.it/ypi2fmtd1bm81.jpg?width=374&format=pjpg&auto=webp&s=f1414b1d9cfc41969fb4866bdf9ecc56a659e761 I want am to know how to locate those weights and how to extract their location with respect to the (H x W x 256) tensor size output of FPN. submitted by /u/shingekichan1996 [link] [comments]  ( 1 min )
    [D] Recommendations on managing data pipelines and structure for object detection for 10 mi+ images
    Hey guys, so where I work we have nearly 10mi+ images annotated with bounding boxes (and the number keeps growing). To store this metadata we have been using a custom Postgres instance and linking the s3 images through the tables. We have also internally developed a Labelling Tool that integrates with our database As we start hitting the level of scale we are planning, we don't want to keep managing the database ourselves but rather leverage 3rd party tools (opensource and paid tools are fine). As an example one of the issues we are facing at the moment is that our images are located in western Europe but we have some labellers in Argentina, the images are taking 2 to 3 seconds to load on their side as we haven't configured region replication on AWS, and honestly, we don't want to keep developing this and end up reinventing the wheel for this context. ​ We also use our Postgres database to store the metrics for the models we have just trained ​ I have been looking for ML tools but I haven't been able to find a good solution or architecture with a focus on Object Detection. What are you guys using out there or what do you guys recommend? ​ Thanks in advance submitted by /u/Fr33akBoy [link] [comments]  ( 1 min )
    [P] ArtiV: Version control system for large files
    Hello r/MachineLearning I would like to share a tool we just release for large files data version. How do you version your data for machine learning? Just tar your data and put it in s3? use a shared folder in NFS? or use git-like solutions, like DVC, Git LFS? Undoubtedly, putting large files in s3 (or similar object store) or NFS is the most common solution. However, when it comes to version control, Git is the defacto solution. But Git is not designed for versioning large files. If we want to put large files on S3, why don’t we just use a tool that does the data versioning just on top of S3? This is the motivation why we develop the ArtiV For more detail, I recommend you to read this blog post Or please check out the git repository infuseai/artiv. submitted by /u/popcornylu [link] [comments]  ( 3 min )
    [D] Following latest trends / reading latest papers worth the time
    For all in non-academic roles that follow latest papers on DS/ML: Care to share you conclusion - was it worth investing your time, and did you manage to use those advancements at work? Also, any caveat emptors? I.e. any cases where you jumped on the new bandwagon as paper suggested and got burned? Thanks for your input. submitted by /u/a90501 [link] [comments]  ( 2 min )
    [P] Training on multiple seperate sequences
    I have a problem in which I need to train a network on multiple sequences that cannot be stacked on top of one another. I'd like to train a network (likely an LSTM) to learn the pattern of responses based on each of these sequences. ​ For example, we might have one sequence with the sequential values Observation Location 1x Location 1y Location 2 (response) 1 3.8 2.5 9.4 2 3.9 2.7 9.7 and another with the values Observation Location 1x Location 1y Location 2 (response) 1 9.4 4.6 16.8 2 9.2 4.1 16.2 Observation 2 from the first table and observation 1 from the second table do not follow each other. I then need to predict on an unseen sequence like Location 1x Location 1y 5.6 8.4 5.6 8.1 which is also not correlated to the first two, except that the first two should give a guideline on how to predict the sequence. Ideally, I'd like to do K-fold cross-validation where a subset of the sequences is used to train and the other subset is used to validate. I've looked into multiple series predictions, but haven't been able to find what I'm looking for. Would someone be able to point me in the right direction? submitted by /u/EthanUnited [link] [comments]  ( 1 min )
  • Open

    Ithaca | Restoring and attributing ancient texts using deep neural networks
    submitted by /u/nickb [link] [comments]
    How does backpropogation work for multi-layer perceptition networks?
    When backpropogation occurs, does the algorithm either take the average of the cost for a given batch to then calculate the gradient descent and then backpropogate, or does it work out the cost for each trial, record the backpropogation changes to the weights and biases, and then average the desired changes to the weights and biases for the network to complete the backpropogation? submitted by /u/EpicJoeR [link] [comments]  ( 1 min )
    Maths for Machine Learning - Linear Algebra was developed to solve systems of linear equations. This video solidifies your basic understanding of linear equations, linear independence and a matrix from a geometric perspective.
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    Optimizing Airline Tail Assignments for Cleaner Skies
    Posted by Emily Masten, Software Engineer, Google Research, Operations Research Team Airlines around the world are exploring several tactics to meet aggressive CO2 commitments set by the International Civil Aviation Organization (ICAO). This effort has been emphasized in Europe, where aviation accounts for 13.9% of the transportation industry’s carbon emissions. The largest push comes from the European Green Deal, which aims to decrease carbon emissions from transportation by 90% by 2051. The Lufthansa Group has gone even further, committing to a 50% reduction in emissions compared to 2019 by the year 2030 and to reach net-zero emissions by 2050. One unexpected approach that airlines can use to lower carbon emissions is through optimizing their tail assignment, i.e., how to assign air…  ( 6 min )
  • Open

    HUMAN-MACHINE INTERFACE
    WHAT IS HUMAN-MACHINE INTERFACE (HMI)?  ( 1 min )
  • Open

    Range trick for larger sample sizes
    The previous post showed that the standard deviation of a sample of size n can be well estimated by multiplying the sample range by a constant dn that depends on n. The method works well for relatively small n. This should sound strange: typically statistical methods work better for large samples, not small samples. And […] Range trick for larger sample sizes first appeared on John D. Cook.  ( 4 min )
  • Open

    Matrix Decomposition Perspective for Accuracy Assessment of Item Response Theory. (arXiv:2203.03112v1 [stat.CO])
    The item response theory obtains the estimates and their confidence intervals for parameters of abilities of examinees and difficulties of problems by using the observed item response matrix consisting of 0/1 value elements. Many papers discuss the performance of the estimates. However, this paper does not. Using the maximum likelihood estimates, we can reconstruct the estimated item response matrix. Then we can assess the accuracy of this reconstructed matrix to the observed response matrix from the matrix decomposition perspective. That is, this paper focuses on the performance of the reconstructed response matrix. To compare the performance of the item response theory with others, we provided the two kinds of low rank response matrix by approximating the observed response matrix; one is the matrix via the singular value decomposition method when the response matrix is a complete matrix, and the other is the matrix via the matrix decomposition method when the response matrix is an incomplete matrix. We have, firstly, found that the performance of the singular value decomposition method and the matrix decomposition method is almost the same when the response matrix is a complete matrix. Here, the performance is measured by the closeness between the two matrices using the root mean squared errors and the accuracy. Secondary, we have seen that the closeness of the reconstructed matrix obtained from the item response theory to the observed matrix is located between the two approximated low rank response matrices obtained from the matrix decomposition method of k= and k=2 to the observed matrix, where k indicates the first k columns use in the decomposed matrices.
    Hierarchical Bayesian Bandits. (arXiv:2111.06929v2 [cs.LG] UPDATED)
    Meta-, multi-task, and federated learning can be all viewed as solving similar tasks, drawn from a distribution that reflects task similarities. We provide a unified view of all these problems, as learning to act in a hierarchical Bayesian bandit. We propose and analyze a natural hierarchical Thompson sampling algorithm (HierTS) for this class of problems. Our regret bounds hold for many variants of the problems, including when the tasks are solved sequentially or in parallel; and show that the regret decreases with a more informative prior. Our proofs rely on a novel total variance decomposition that can be applied beyond our models. Our theory is complemented by experiments, which show that the hierarchy helps with knowledge sharing among the tasks. This confirms that hierarchical Bayesian bandits are a universal and statistically-efficient tool for learning to act with similar bandit tasks.  ( 2 min )
    Direct Random Search for Fine Tuning of Deep Reinforcement Learning Policies. (arXiv:2109.05604v2 [cs.RO] UPDATED)
    Researchers have demonstrated that Deep Reinforcement Learning (DRL) is a powerful tool for finding policies that perform well on complex robotic systems. However, these policies are often unpredictable and can induce highly variable behavior when evaluated with only slightly different initial conditions. Training considerations constrain DRL algorithm designs in that most algorithms must use stochastic policies during training. The resulting policy used during deployment, however, can and frequently is a deterministic one that uses the Maximum Likelihood Action (MLA) at each step. In this work, we show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts. We illustrate this across a large collection of reinforcement learning environments, using a wide variety of policies obtained from different algorithms. Our results show that this method yields more consistent and higher performing agents on the environments we tested. Furthermore, we demonstrate how this method can be used to extend our previous work on shrinking the dimensionality of the reachable state space of closed-loop systems run under Deep Neural Network (DNN) policies.  ( 2 min )
    Neuron Coverage-Guided Domain Generalization. (arXiv:2103.00229v2 [cs.LG] UPDATED)
    This paper focuses on the domain generalization task where domain knowledge is unavailable, and even worse, only samples from a single domain can be utilized during training. Our motivation originates from the recent progresses in deep neural network (DNN) testing, which has shown that maximizing neuron coverage of DNN can help to explore possible defects of DNN (i.e., misclassification). More specifically, by treating the DNN as a program and each neuron as a functional point of the code, during the network training we aim to improve the generalization capability by maximizing the neuron coverage of DNN with the gradient similarity regularization between the original and augmented samples. As such, the decision behavior of the DNN is optimized, avoiding the arbitrary neurons that are deleterious for the unseen samples, and leading to the trained DNN that can be better generalized to out-of-distribution samples. Extensive studies on various domain generalization tasks based on both single and multiple domain(s) setting demonstrate the effectiveness of our proposed approach compared with state-of-the-art baseline methods. We also analyze our method by conducting visualization based on network dissection. The results further provide useful evidence on the rationality and effectiveness of our approach.  ( 2 min )
    Hierarchically Structured Scheduling and Execution of Tasks in a Multi-Agent Environment. (arXiv:2203.03021v1 [cs.LG])
    In a warehouse environment, tasks appear dynamically. Consequently, a task management system that matches them with the workforce too early (e.g., weeks in advance) is necessarily sub-optimal. Also, the rapidly increasing size of the action space of such a system consists of a significant problem for traditional schedulers. Reinforcement learning, however, is suited to deal with issues requiring making sequential decisions towards a long-term, often remote, goal. In this work, we set ourselves on a problem that presents itself with a hierarchical structure: the task-scheduling, by a centralised agent, in a dynamic warehouse multi-agent environment and the execution of one such schedule, by decentralised agents with only partial observability thereof. We propose to use deep reinforcement learning to solve both the high-level scheduling problem and the low-level multi-agent problem of schedule execution. Finally, we also conceive the case where centralisation is impossible at test time and workers must learn how to cooperate in executing the tasks in an environment with no schedule and only partial observability.  ( 2 min )
    Leashing the Inner Demons: Self-Detoxification for Language Models. (arXiv:2203.03072v1 [cs.CL])
    Language models (LMs) can reproduce (or amplify) toxic language seen during training, which poses a risk to their practical application. In this paper, we conduct extensive experiments to study this phenomenon. We analyze the impact of prompts, decoding strategies and training corpora on the output toxicity. Based on our findings, we propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator. Compared to a supervised baseline, our proposed method shows better toxicity reduction with good generation quality in the generated content under multiple settings. Warning: some examples shown in the paper may contain uncensored offensive content.  ( 2 min )
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v2 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.  ( 2 min )
    Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach. (arXiv:2106.02968v3 [cs.LG] UPDATED)
    Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. The large scale of data sets used in deep learning forces most sample selection strategies to employ efficient heuristics. This paper introduces an integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. We demonstrate that this problem can be tractably solved with a Generalized Benders Decomposition algorithm. Our strategy uses high-quality latent features that can be obtained by unsupervised learning on the unlabeled pool. Numerical results on several data sets show that our optimization approach is competitive with baselines and particularly outperforms them in the low budget regime where less than one percent of the data set is labeled.  ( 2 min )
    Distributional Hardness Against Preconditioned Lasso via Erasure-Robust Designs. (arXiv:2203.02824v1 [cs.DS])
    Sparse linear regression with ill-conditioned Gaussian random designs is widely believed to exhibit a statistical/computational gap, but there is surprisingly little formal evidence for this belief, even in the form of examples that are hard for restricted classes of algorithms. Recent work has shown that, for certain covariance matrices, the broad class of Preconditioned Lasso programs provably cannot succeed on polylogarithmically sparse signals with a sublinear number of samples. However, this lower bound only shows that for every preconditioner, there exists at least one signal that it fails to recover successfully. This leaves open the possibility that, for example, trying multiple different preconditioners solves every sparse linear regression problem. In this work, we prove a stronger lower bound that overcomes this issue. For an appropriate covariance matrix, we construct a single signal distribution on which any invertibly-preconditioned Lasso program fails with high probability, unless it receives a linear number of samples. Surprisingly, at the heart of our lower bound is a new positive result in compressed sensing. We show that standard sparse random designs are with high probability robust to adversarial measurement erasures, in the sense that if $b$ measurements are erased, then all but $O(b)$ of the coordinates of the signal are still information-theoretically identifiable. To our knowledge, this is the first time that partial recoverability of arbitrary sparse signals under erasures has been studied in compressed sensing.  ( 2 min )
    Automated Identification of Cell Populations in Flow Cytometry Data with Transformers. (arXiv:2108.10072v2 [q-bio.QM] UPDATED)
    Acute Lymphoblastic Leukemia (ALL) is the most frequent hematologic malignancy in children and adolescents. A strong prognostic factor in ALL is given by the Minimal Residual Disease (MRD), which is a measure for the number of leukemic cells persistent in a patient. Manual MRD assessment from Multiparameter Flow Cytometry (FCM) data after treatment is time-consuming and subjective. In this work, we present an automated method to compute the MRD value directly from FCM data. We present a novel neural network approach based on the transformer architecture that learns to directly identify blast cells in a sample. We train our method in a supervised manner and evaluate it on publicly available ALL FCM data from three different clinical centers. Our method reaches a median F1 score of ~0.94 when evaluated on 519 B-ALL samples and shows better results than existing methods on 4 different datasets  ( 2 min )
    Deep Fusion of Lead-lag Graphs: Application to Cryptocurrencies. (arXiv:2201.02040v2 [cs.LG] UPDATED)
    The study of time series has motivated many researchers, particularly on the area of multivariate-analysis. The study of co-movements and dependency between random variables leads us to develop metrics to describe existing connection between assets. The most commonly used are correlation and causality. Despite the growing literature, some connections remained still undetected. The objective of this paper is to propose a new representation learning algorithm capable to integrate synchronous and asynchronous relationships.  ( 2 min )
    Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning. (arXiv:2203.03110v1 [cs.LG])
    In this paper, we study gap-dependent regret guarantees for risk-sensitive reinforcement learning based on the entropic risk measure. We propose a novel definition of sub-optimality gaps, which we call cascaded gaps, and we discuss their key components that adapt to the underlying structures of the problem. Based on the cascaded gaps, we derive non-asymptotic and logarithmic regret bounds for two model-free algorithms under episodic Markov decision processes. We show that, in appropriate settings, these bounds feature exponential improvement over existing ones that are independent of gaps. We also prove gap-dependent lower bounds, which certify the near optimality of the upper bounds.  ( 2 min )
    Information Compensation for Deep Conditional Generative Networks. (arXiv:2001.08559v3 [cs.LG] UPDATED)
    In recent years, unsupervised/weakly-supervised conditional generative adversarial networks (GANs) have achieved many successes on the task of modeling and generating data. However, one of their weaknesses lies in their poor ability to separate, or disentangle, the different factors that characterize the representation encoded in their latent space. To address this issue, we propose a novel structure for unsupervised conditional GANs powered by a novel Information Compensation Connection (IC-Connection). The proposed IC-Connection enables GANs to compensate for information loss incurred during deconvolution operations. In addition, to quantify the degree of disentanglement on both discrete and continuous latent variables, we design a novel evaluation procedure. Our empirical results suggest that our method achieves better disentanglement compared to the state-of-the-art GANs in a conditional generation setting.  ( 2 min )
    Applications of Multi-Agent Reinforcement Learning in Future Internet: A Comprehensive Survey. (arXiv:2110.13484v2 [cs.AI] UPDATED)
    Future Internet involves several emerging technologies such as 5G and beyond 5G networks, vehicular networks, unmanned aerial vehicle (UAV) networks, and Internet of Things (IoTs). Moreover, future Internet becomes heterogeneous and decentralized with a large number of involved network entities. Each entity may need to make its local decision to improve the network performance under dynamic and uncertain network environments. Standard learning algorithms such as single-agent Reinforcement Learning (RL) or Deep Reinforcement Learning (DRL) have been recently used to enable each network entity as an agent to learn an optimal decision-making policy adaptively through interacting with the unknown environments. However, such an algorithm fails to model the cooperations or competitions among network entities, and simply treats other entities as a part of the environment that may result in the non-stationarity issue. Multi-agent Reinforcement Learning (MARL) allows each network entity to learn its optimal policy by observing not only the environments, but also other entities' policies. As a result, MARL can significantly improve the learning efficiency of the network entities, and it has been recently used to solve various issues in the emerging networks. In this paper, we thus review the applications of MARL in the emerging networks. In particular, we provide a tutorial of MARL and a comprehensive survey of applications of MARL in next generation Internet. In particular, we first introduce single-agent RL and MARL. Then, we review a number of applications of MARL to solve emerging issues in future Internet. The issues consist of network access, transmit power control, computation offloading, content caching, packet routing, trajectory design for UAV-aided networks, and network security issues.  ( 3 min )
    Diffusion Maps : Using the Semigroup Property for Parameter Tuning. (arXiv:2203.02867v1 [stat.ML])
    Diffusion maps (DM) constitute a classic dimension reduction technique, for data lying on or close to a (relatively) low-dimensional manifold embedded in a much larger dimensional space. The DM procedure consists in constructing a spectral parametrization for the manifold from simulated random walks or diffusion paths on the data set. However, DM is hard to tune in practice. In particular, the task to set a diffusion time t when constructing the diffusion kernel matrix is critical. We address this problem by using the semigroup property of the diffusion operator. We propose a semigroup criterion for picking t. Experiments show that this principled approach is effective and robust.  ( 2 min )
    You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks. (arXiv:2106.13264v2 [cs.LG] UPDATED)
    Hypergraphs are used to model higher-order interactions amongst agents and there exist many practically relevant instances of hypergraph datasets. To enable efficient processing of hypergraph-structured data, several hypergraph neural network platforms have been proposed for learning hypergraph properties and structure, with a special focus on node classification. However, almost all existing methods use heuristic propagation rules and offer suboptimal performance on many datasets. We propose AllSet, a new hypergraph neural network paradigm that represents a highly general framework for (hyper)graph neural networks and for the first time implements hypergraph neural network layers as compositions of two multiset functions that can be efficiently learned for each task and each dataset. Furthermore, AllSet draws on new connections between hypergraph neural networks and recent advances in deep learning of multiset functions. In particular, the proposed architecture utilizes Deep Sets and Set Transformer architectures that allow for significant modeling flexibility and offer high expressive power. To evaluate the performance of AllSet, we conduct the most extensive experiments to date involving ten known benchmarking datasets and three newly curated datasets that represent significant challenges for hypergraph node classification. The results demonstrate that AllSet has the unique ability to consistently either match or outperform all other hypergraph neural networks across the tested datasets.  ( 2 min )
    Combining Individual and Joint Networking Behavior for Intelligent IoT Analytics. (arXiv:2203.03109v1 [cs.LG])
    The IoT vision of a trillion connected devices over the next decade requires reliable end-to-end connectivity and automated device management platforms. While we have seen successful efforts for maintaining small IoT testbeds, there are multiple challenges for the efficient management of large-scale device deployments. With Industrial IoT, incorporating millions of devices, traditional management methods do not scale well. In this work, we address these challenges by designing a set of novel machine learning techniques, which form a foundation of a new tool, it IoTelligent, for IoT device management, using traffic characteristics obtained at the network level. The design of our tool is driven by the analysis of 1-year long networking data, collected from 350 companies with IoT deployments. The exploratory analysis of this data reveals that IoT environments follow the famous Pareto principle, such as: (i) 10% of the companies in the dataset contribute to 90% of the entire traffic; (ii) 7% of all the companies in the set own 90% of all the devices. We designed and evaluated CNN, LSTM, and Convolutional LSTM models for demand forecasting, with a conclusion of the Convolutional LSTM model being the best. However, maintaining and updating individual company models is expensive. In this work, we design a novel, scalable approach, where a general demand forecasting model is built using the combined data of all the companies with a normalization factor. Moreover, we introduce a novel technique for device management, based on autoencoders. They automatically extract relevant device features to identify device groups with similar behavior to flag anomalous devices.  ( 2 min )
    A Unified View of SDP-based Neural Network Verification through Completely Positive Programming. (arXiv:2203.03034v1 [math.OC])
    Verifying that input-output relationships of a neural network conform to prescribed operational specifications is a key enabler towards deploying these networks in safety-critical applications. Semidefinite programming (SDP)-based approaches to Rectified Linear Unit (ReLU) network verification transcribe this problem into an optimization problem, where the accuracy of any such formulation reflects the level of fidelity in how the neural network computation is represented, as well as the relaxations of intractable constraints. While the literature contains much progress on improving the tightness of SDP formulations while maintaining tractability, comparatively little work has been devoted to the other extreme, i.e., how to most accurately capture the original verification problem before SDP relaxation. In this work, we develop an exact, convex formulation of verification as a completely positive program (CPP), and provide analysis showing that our formulation is minimal -- the removal of any constraint fundamentally misrepresents the neural network computation. We leverage our formulation to provide a unifying view of existing approaches, and give insight into the source of large relaxation gaps observed in some cases.  ( 2 min )
    You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection. (arXiv:2109.00962v2 [eess.AS] UPDATED)
    Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely a convolutional neural network and has no recurrent layers, it is faster during inference. In addition, as this approach is more end-to-end and predicts acoustic boundaries directly, it is significantly quicker during post-processing and smoothing.  ( 2 min )
    GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction. (arXiv:2203.03079v1 [cs.CV])
    Attaching attributes (such as color, shape, state, action) to object categories is an important computer vision problem. Attribute prediction has seen exciting recent progress and is often formulated as a multi-label classification problem. Yet significant challenges remain in: 1) predicting diverse attributes over multiple categories, 2) modeling attributes-category dependency, 3) capturing both global and local scene context, and 4) predicting attributes of objects with low pixel-count. To address these issues, we propose a novel multi-category attribute prediction deep architecture named GlideNet, which contains three distinct feature extractors. A global feature extractor recognizes what objects are present in a scene, whereas a local one focuses on the area surrounding the object of interest. Meanwhile, an intrinsic feature extractor uses an extension of standard convolution dubbed Informed Convolution to retrieve features of objects with low pixel-count. GlideNet uses gating mechanisms with binary masks and its self-learned category embedding to combine the dense embeddings. Collectively, the Global-Local-Intrinsic blocks comprehend the scene's global context while attending to the characteristics of the local object of interest. Finally, using the combined features, an interpreter predicts the attributes, and the length of the output is determined by the category, thereby removing unnecessary attributes. GlideNet can achieve compelling results on two recent and challenging datasets -- VAW and CAR -- for large-scale attribute prediction. For instance, it obtains more than 5\% gain over state of the art in the mean recall (mR) metric. GlideNet's advantages are especially apparent when predicting attributes of objects with low pixel counts as well as attributes that demand global context understanding. Finally, we show that GlideNet excels in training starved real-world scenarios.  ( 2 min )
    Scalable Uncertainty Quantification for Deep Operator Networks using Randomized Priors. (arXiv:2203.03048v1 [cs.LG])
    We present a simple and effective approach for posterior uncertainty quantification in deep operator networks (DeepONets); an emerging paradigm for supervised learning in function spaces. We adopt a frequentist approach based on randomized prior ensembles, and put forth an efficient vectorized implementation for fast parallel inference on accelerated hardware. Through a collection of representative examples in computational mechanics and climate modeling, we show that the merits of the proposed approach are fourfold. (1) It can provide more robust and accurate predictions when compared against deterministic DeepONets. (2) It shows great capability in providing reliable uncertainty estimates on scarce data-sets with multi-scale function pairs. (3) It can effectively detect out-of-distribution and adversarial examples. (4) It can seamlessly quantify uncertainty due to model bias, as well as noise corruption in the data. Finally, we provide an optimized JAX library called {\em UQDeepONet} that can accommodate large model architectures, large ensemble sizes, as well as large data-sets with excellent parallel performance on accelerated hardware, thereby enabling uncertainty quantification for DeepONets in realistic large-scale applications.  ( 2 min )
    Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation. (arXiv:2203.03057v1 [cs.CV])
    Best-of-N (BoN) Average Displacement Error (ADE)/ Final Displacement Error (FDE) is the most used metric for evaluating trajectory prediction models. Yet, the BoN does not quantify the whole generated samples, resulting in an incomplete view of the model's prediction quality and performance. We propose a new metric, Average Mahalanobis Distance (AMD) to tackle this issue. AMD is a metric that quantifies how close the whole generated samples are to the ground truth. We also introduce the Average Maximum Eigenvalue (AMV) metric that quantifies the overall spread of the predictions. Our metrics are validated empirically by showing that the ADE/FDE is not sensitive to distribution shifts, giving a biased sense of accuracy, unlike the AMD/AMV metrics. We introduce the usage of Implicit Maximum Likelihood Estimation (IMLE) as a replacement for traditional generative models to train our model, Social-Implicit. IMLE training mechanism aligns with AMD/AMV objective of predicting trajectories that are close to the ground truth with a tight spread. Social-Implicit is a memory efficient deep model with only 5.8K parameters that runs in real time of about 580Hz and achieves competitive results. Interactive demo of the problem can be seen here \url{https://www.abduallahmohamed.com/social-implicit-amdamv-adefde-demo}. Code is available at \url{https://github.com/abduallahmohamed/Social-Implicit}.  ( 2 min )
    Deep Dynamic Boosted Forest. (arXiv:1804.07270v4 [cs.LG] UPDATED)
    Random forest is widely exploited as an ensemble learning method. In many practical applications, however, there is still a significant challenge to learn from imbalanced data. To alleviate this limitation, we propose a deep dynamic boosted forest (DDBF), a novel ensemble algorithm that incorporates the notion of hard example mining into random forest. Specically, we propose to measure the quality of each leaf node of every decision tree in the random forest to determine hard examples. By iteratively training and then removing easy examples from training data, we evolve the random forest to focus on hard examples dynamically so as to balance the proportion of samples and learn decision boundaries better. Data can be cascaded through these random forests learned in each iteration in sequence to generate more accurate predictions. Our DDBF outperforms random forest on 5 UCI datasets, MNIST and SATIMAGE, and achieved state-of-the-art results compared to other deep models. Moreover, we show that DDBF is also a new way of sampling and can be very useful and efficient when learning from imbalanced data.  ( 2 min )
    Smoothing with the Best Rectangle Window is Optimal for All Tapered Rectangle Windows. (arXiv:2203.02997v1 [stat.ML])
    We investigate the optimal selection of weight windows for the problem of weighted least squares. We show that weight windows should be symmetric around its center, which is also its peak. We consider the class of tapered rectangle window weights, which are nonincreasing away from the center. We show that the best rectangle window is optimal for such window definitions. We also extend our results to the least absolutes and more general case of arbitrary loss functions to find similar results.  ( 2 min )
    Towards Open-World Recommendation: An Inductive Model-based Collaborative Filtering Approach. (arXiv:2007.04833v3 [cs.IR] UPDATED)
    Recommendation models can effectively estimate underlying user interests and predict one's future behaviors by factorizing an observed user-item rating matrix into products of two sets of latent factors. However, the user-specific embedding factors can only be learned in a transductive way, making it difficult to handle new users on-the-fly. In this paper, we propose an inductive collaborative filtering framework that contains two representation models. The first model follows conventional matrix factorization which factorizes a group of key users' rating matrix to obtain meta latents. The second model resorts to attention-based structure learning that estimates hidden relations from query to key users and learns to leverage meta latents to inductively compute embeddings for query users via neural message passing. Our model enables inductive representation learning for users and meanwhile guarantees equivalent representation capacity as matrix factorization. Experiments demonstrate that our model achieves promising results for recommendation on few-shot users with limited training ratings and new unseen users which are commonly encountered in open-world recommender systems.  ( 2 min )
    SurvSet: An open-source time-to-event dataset repository. (arXiv:2203.03094v1 [stat.ML])
    Time-to-event (T2E) analysis is a branch of statistics that models the duration of time it takes for an event to occur. Such events can include outcomes like death, unemployment, or product failure. Most modern machine learning (ML) algorithms, like decision trees and kernel methods, are supported for T2E modelling with data science software (python and R). To complement these developments, SurvSet is the first open-source T2E dataset repository designed for a rapid benchmarking of ML algorithms and statistical methods. The data in SurvSet have been consistently formatted so that a single preprocessing method will work for all datasets. SurvSet currently has 76 datasets which vary in dimensionality, time dependency, and background (the majority of which come from biomedicine). SurvSet is available on PyPI and can be installed with pip install SurvSet. R users can download the data directly from the corresponding git repository.  ( 2 min )
    EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks. (arXiv:2006.04270v5 [cs.LG] UPDATED)
    Dropout is a well-known regularization method by sampling a sub-network from a larger deep neural network and training different sub-networks on different subsets of the data. Inspired by the dropout concept, we propose EDropout as an energy-based framework for pruning neural networks in classification tasks. In this approach, a set of binary pruning state vectors (population) represents a set of corresponding sub-networks from an arbitrary provided original neural network. An energy loss function assigns a scalar energy loss value to each pruning state. The energy-based model stochastically evolves the population to find states with lower energy loss. The best pruning state is then selected and applied to the original network. Similar to dropout, the kept weights are updated using backpropagation in a probabilistic model. The energy-based model again searches for better pruning states and the cycle continuous. Indeed, this procedure is in fact switching between the energy model, which manages the pruning states, and the probabilistic model, which updates the temporarily unpruned weights, in each iteration. The population can dynamically converge to a pruning state. This can be interpreted as dropout leading to pruning the network. From an implementation perspective, EDropout can prune typical neural networks without modification of the network architecture. We evaluated the proposed method on different flavours of ResNets, AlexNet, and SqueezeNet on the Kuzushiji, Fashion, CIFAR-10, CIFAR-100, and Flowers datasets, and compared the pruning rate and classification performance of the models. On average the networks trained with EDropout achieved a pruning rate of more than $50\%$ of the trainable parameters with approximately $<5\%$ and $<1\%$ drop of Top-1 and Top-5 classification accuracy, respectively.  ( 3 min )
    An Improved Genetic Algorithm and Its Application in Neural Network Adversarial Attack. (arXiv:2110.01818v5 [cs.NE] UPDATED)
    The choice of crossover and mutation strategies plays a crucial role in the searchability, convergence efficiency and precision of genetic algorithms. In this paper, a novel improved genetic algorithm is proposed by improving the crossover and mutation operation of the simple genetic algorithm, and it is verified by 15 test functions. The qualitative results show that, compared with three other mainstream swarm intelligence optimization algorithms, the algorithm can not only improve the global search ability, convergence efficiency and precision, but also increase the success rate of convergence to the optimal value under the same experimental conditions. The quantitative results show that the algorithm performs superiorly in 13 of the 15 tested functions. The Wilcoxon rank-sum test was used for statistical evaluation, showing the significant advantage of the algorithm at $95\%$ confidence intervals. Finally, the algorithm is applied to neural network adversarial attacks. The applied results show that the method does not need the structure and parameter information inside the neural network model, and it can obtain the adversarial samples with high confidence in a brief time just by the classification and confidence information output from the neural network.
    Active Inference and Epistemic Value in Graphical Models. (arXiv:2109.00541v2 [stat.ML] UPDATED)
    The Free Energy Principle (FEP) postulates that biological agents perceive and interact with their environment in order to minimize a Variational Free Energy (VFE) with respect to a generative model of their environment. The inference of a policy (future control sequence) according to the FEP is known as Active Inference (AIF). The AIF literature describes multiple VFE objectives for policy planning that lead to epistemic (information-seeking) behavior. However, most objectives have limited modeling flexibility. This paper approaches epistemic behavior from a constrained Bethe Free Energy (CBFE) perspective. Crucially, variational optimization of the CBFE can be expressed in terms of message passing on free-form generative models. The key intuition behind the CBFE is that we impose a point-mass constraint on predicted outcomes, which explicitly encodes the assumption that the agent will make observations in the future. We interpret the CBFE objective in terms of its constituent behavioral drives. We then illustrate resulting behavior of the CBFE by planning and interacting with a simulated T-maze environment. Simulations for the T-maze task illustrate how the CBFE agent exhibits an epistemic drive, and actively plans ahead to account for the impact of predicted outcomes. Compared to an EFE agent, the CBFE agent incurs expected reward in significantly more environmental scenarios. We conclude that CBFE optimization by message passing suggests a general mechanism for epistemic-aware AIF in free-form generative models.
    Anomaly Detection by Recombining Gated Unsupervised Experts. (arXiv:2008.13763v4 [cs.LG] UPDATED)
    Anomaly detection has been considered under several extents of prior knowledge. Unsupervised methods do not require any labelled data, whereas semi-supervised methods leverage some known anomalies. Inspired by mixture-of-experts models and the analysis of the hidden activations of neural networks, we introduce a novel data-driven anomaly detection method called ARGUE. Our method is not only applicable to unsupervised and semi-supervised environments, but also profits from prior knowledge of self-supervised settings. We designed ARGUE as a combination of dedicated expert networks, which specialise on parts of the input data. For its final decision, ARGUE fuses the distributed knowledge across the expert systems using a gated mixture-of-experts architecture. Our evaluation motivates that prior knowledge about the normal data distribution may be as valuable as known anomalies.
    Maillard Sampling: Boltzmann Exploration Done Optimally. (arXiv:2111.03290v2 [stat.ML] UPDATED)
    The PhD thesis of Maillard (2013) presents a rather obscure algorithm for the $K$-armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a \textit{closed form}, which is not true for Thompson sampling, a widely-adopted bandit algorithm in the industry. This means that the bandit-logged data from running MS can be readily used for counterfactual evaluation, unlike Thompson sampling. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and $\sqrt{KT\log{T}}$ minimax regret bound where $T$ is the time horizon, which matches the known bounds for asymptotically optimal UCB. %'s performance. We then propose a variant of MS called MS$^+$ that improves its minimax bound to $\sqrt{KT\log{K}}$. MS$^+$ can also be tuned to be aggressive (i.e., less exploration) without losing the asymptotic optimality, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS$^+$.
    Flower: A Friendly Federated Learning Research Framework. (arXiv:2007.14390v5 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the data in the cloud. However, FL is difficult to implement realistically, both in terms of scale and systems heterogeneity. Although there are a number of research frameworks available to simulate FL algorithms, they do not support the study of scalable FL workloads on heterogeneous edge devices. In this paper, we present Flower -- a comprehensive FL framework that distinguishes itself from existing platforms by offering new facilities to execute large-scale FL experiments and consider richly heterogeneous FL device scenarios. Our experiments show Flower can perform FL experiments up to 15M in client size using only a pair of high-end GPUs. Researchers can then seamlessly migrate experiments to real devices to examine other parts of the design space. We believe Flower provides the community with a critical new tool for FL study and development.
    Privacy Amplification by Decentralization. (arXiv:2012.05326v4 [cs.LG] UPDATED)
    Analyzing data owned by several parties while achieving a good trade-off between utility and privacy is a key challenge in federated learning and analytics. In this work, we introduce a novel relaxation of local differential privacy (LDP) that naturally arises in fully decentralized algorithms, i.e., when participants exchange information by communicating along the edges of a network graph without central coordinator. This relaxation, that we call network DP, captures the fact that users have only a local view of the system. To show the relevance of network DP, we study a decentralized model of computation where a token performs a walk on the network graph and is updated sequentially by the party who receives it. For tasks such as real summation, histogram computation and optimization with gradient descent, we propose simple algorithms on ring and complete topologies. We prove that the privacy-utility trade-offs of our algorithms under network DP significantly improve upon what is achievable under LDP, and often match the utility of the trusted curator model. Our results show for the first time that formal privacy gains can be obtained from full decentralization. We also provide experiments to illustrate the improved utility of our approach for decentralized training with stochastic gradient descent.
    L-Verse: Bidirectional Generation Between Image and Text. (arXiv:2111.11133v6 [cs.CV] UPDATED)
    Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for text-to-image and image-to-text generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial results of bidirectional vision-language representation learning on general domain.
    Scene Transformer: A unified architecture for predicting multiple agent trajectories. (arXiv:2106.08417v3 [cs.CV] UPDATED)
    Predicting the motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g. vehicles and pedestrians) and their associated behaviors may be diverse and influence one another. Most prior work have focused on predicting independent futures for each agent based on all past motion, and planning against these independent predictions. However, planning against independent predictions can make it challenging to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly, producing consistent futures that account for interactions between agents. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy, we show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.
    Provable RL with Exogenous Distractors via Multistep Inverse Dynamics. (arXiv:2110.08847v2 [cs.LG] UPDATED)
    Many real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera. Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations and subsequently plan efficiently. However, such approaches can fail in the presence of temporally correlated noise in the observations, a phenomenon that is common in practice. We initiate the formal study of latent state discovery in the presence of such exogenous noise sources by proposing a new model, the Exogenous Block MDP (EX-BMDP), for rich observation RL. We start by establishing several negative results, by highlighting failure cases of prior representation learning based approaches. Then, we introduce the Predictive Path Elimination (PPE) algorithm, that learns a generalization of inverse dynamics and is provably sample and computationally efficient in EX-BMDPs when the endogenous state dynamics are near deterministic. The sample complexity of PPE depends polynomially on the size of the latent endogenous state space while not directly depending on the size of the observation space, nor the exogenous state space. We provide experiments on challenging exploration problems which show that our approach works empirically.
    On the Fairness of Causal Algorithmic Recourse. (arXiv:2010.06529v5 [cs.LG] UPDATED)
    Algorithmic fairness is typically studied from the perspective of predictions. Instead, here we investigate fairness from the perspective of recourse actions suggested to individuals to remedy an unfavourable classification. We propose two new fairness criteria at the group and individual level, which -- unlike prior work on equalising the average group-wise distance from the decision boundary -- explicitly account for causal relationships between features, thereby capturing downstream effects of recourse actions performed in the physical world. We explore how our criteria relate to others, such as counterfactual fairness, and show that fairness of recourse is complementary to fairness of prediction. We study theoretically and empirically how to enforce fair causal recourse by altering the classifier and perform a case study on the Adult dataset. Finally, we discuss whether fairness violations in the data generating process revealed by our criteria may be better addressed by societal interventions as opposed to constraints on the classifier.
    Minimax Kernel Machine Learning for a Class of Doubly Robust Functionals with Application to Proximal Causal Inference. (arXiv:2104.02929v3 [stat.ML] UPDATED)
    Robins et al. (2008) introduced a class of influence functions (IFs) which could be used to obtain doubly robust moment functions for the corresponding parameters. However, that class does not include the IF of parameters for which the nuisance functions are solutions to integral equations. Such parameters are particularly important in the field of causal inference, specifically in the recently proposed proximal causal inference framework of Tchetgen Tchetgen et al. (2020), which allows for estimating the causal effect in the presence of latent confounders. In this paper, we first extend the class of Robins et al. to include doubly robust IFs in which the nuisance functions are solutions to integral equations. Then we demonstrate that the double robustness property of these IFs can be leveraged to construct estimating equations for the nuisance functions, which enables us to solve the integral equations without resorting to parametric models. We frame the estimation of the nuisance functions as a minimax optimization problem. We provide convergence rates for the nuisance functions and conditions required for asymptotic linearity of the estimator of the parameter of interest. The experiment results demonstrate that our proposed methodology leads to robust and high-performance estimators for average causal effect in the proximal causal inference framework.
    Polynomial Time Reinforcement Learning in Factored State MDPs with Linear Value Functions. (arXiv:2107.05187v2 [cs.LG] UPDATED)
    Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a "factored" structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial-time algorithm for RL in Factored State MDPs (generalizing FMDPs) that neither relies on an oracle planner nor requires a linear transition model; it only requires a linear value function with a suitable local basis with respect to the factorization, permitting efficient variable elimination. With this assumption, we can solve this family of Factored State MDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work on FMDPs, we do not assume that the transitions on various factors are conditionally independent.
    xFAIR: Better Fairness via Model-based Rebalancing of Protected Attributes. (arXiv:2110.01109v2 [cs.LG] UPDATED)
    Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanations on what is the root cause of bias. Objective: We aim at better detection and mitigation of algorithmic discrimination in machine learning software problems. Method: Here we propose xFAIR, a model-based extrapolation method, that is capable of both mitigating bias and explaining the cause. In our xFAIR approach, protected attributes are represented by models learned from the other independent variables (and these models offer extrapolations over the space between existing examples). We then use the extrapolation models to relabel protected attributes later seen in testing data or deployment time. Our approach aims to offset the biased predictions of the classification model via rebalancing the distribution of protected attributes. Results: The experiments of this paper show that, without compromising (original) model performance, xFAIR can achieve significantly better group and individual fairness (as measured in different metrics) than benchmark methods. Moreover, when compared to another instance-based rebalancing method, our model-based approach shows faster runtime and thus better scalability. Conclusion: Algorithmic decision bias can be removed via extrapolation that smooths away outlier points. As evidence for this, our proposed xFAIR is not only performance-wise better (measured by fairness and performance metrics) than two state-of-the-art fairness algorithms.
    A Johnson--Lindenstrauss Framework for Randomly Initialized CNNs. (arXiv:2111.02155v2 [cs.LG] UPDATED)
    How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known mapping. The question for non-linear convolutional neural networks (CNNs) becomes much more intricate. To answer this question, we introduce a geometric framework. For linear CNNs, we show that the Johnson--Lindenstrauss lemma continues to hold, namely, that the angle between two inputs is preserved. For CNNs with ReLU activation, on the other hand, the behavior is richer: The angle between the outputs contracts, where the level of contraction depends on the nature of the inputs. In particular, after one layer, the geometry of natural images is essentially preserved, whereas for Gaussian correlated inputs, CNNs exhibit the same contracting behavior as FNNs with ReLU activation.
    Learning Neural Causal Models with Active Interventions. (arXiv:2109.02429v2 [stat.ML] UPDATED)
    Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far, differentiable causal discovery has focused on static datasets of observational or fixed interventional origin. In this work, we introduce an active intervention targeting (AIT) method which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across multiple frameworks in a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data.
    Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime. (arXiv:2203.03159v1 [cs.LG])
    Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.
    Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization. (arXiv:2203.02839v1 [cs.LG])
    We study the asymmetric matrix factorization problem under a natural nonconvex formulation with arbitrary overparamatrization. We consider the model-free setting with no further assumption on the rank or singular values of the observed matrix, where the global optima provably overfit. We show that vanilla gradient descent with small random initialization and early stopping produces the best low-rank approximation of the observed matrix, without any additional regularization. We provide a sharp analysis on relationship between the iteration complexity, initialization size, stepsize and final error. In particular, our complexity bound is almost dimension-free and depends logarithmically on the final error, and our results have lenient requirements on the stepsize and initialization. Our bounds improve upon existing work and show good agreement with numerical experiments.
    On Steering Multi-Annotations per Sample for Multi-Task Learning. (arXiv:2203.02946v1 [cs.CV])
    The study of multi-task learning has drawn great attention from the community. Despite the remarkable progress, the challenge of optimally learning different tasks simultaneously remains to be explored. Previous works attempt to modify the gradients from different tasks. Yet these methods give a subjective assumption of the relationship between tasks, and the modified gradient may be less accurate. In this paper, we introduce Stochastic Task Allocation~(STA), a mechanism that addresses this issue by a task allocation approach, in which each sample is randomly allocated a subset of tasks. For further progress, we propose Interleaved Stochastic Task Allocation~(ISTA) to iteratively allocate all tasks to each example during several consecutive iterations. We evaluate STA and ISTA on various datasets and applications: NYUv2, Cityscapes, and COCO for scene understanding and instance segmentation. Our experiments show both STA and ISTA outperform current state-of-the-art methods. The code will be available.
    Coresets for Data Discretization and Sine Wave Fitting. (arXiv:2203.03009v1 [cs.LG])
    In the \emph{monitoring} problem, the input is an unbounded stream $P={p_1,p_2\cdots}$ of integers in $[N]:=\{1,\cdots,N\}$, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the $n$ points received so far in $P$ by a single frequency $\sin$, e.g. $\min_{c\in C}cost(P,c)+\lambda(c)$, where $cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic)$, $C\subseteq [N]$ is a feasible set of solutions, and $\lambda$ is a given regularization function. For any approximation error $\varepsilon>0$, we prove that \emph{every} set $P$ of $n$ integers has a weighted subset $S\subseteq P$ (sometimes called core-set) of cardinality $|S|\in O(\log(N)^{O(1)})$ that approximates $cost(P,c)$ (for every $c\in [N]$) up to a multiplicative factor of $1\pm\varepsilon$. Using known coreset techniques, this implies streaming algorithms using only $O((\log(N)\log(n))^{O(1)})$ memory. Our results hold for a large family of functions. Experimental results and open source code are provided.
    Is Bayesian Model-Agnostic Meta Learning Better than Model-Agnostic Meta Learning, Provably?. (arXiv:2203.03059v1 [cs.LG])
    Meta learning aims at learning a model that can quickly adapt to unseen tasks. Widely used meta learning methods include model agnostic meta learning (MAML), implicit MAML, Bayesian MAML. Thanks to its ability of modeling uncertainty, Bayesian MAML often has advantageous empirical performance. However, the theoretical understanding of Bayesian MAML is still limited, especially on questions such as if and when Bayesian MAML has provably better performance than MAML. In this paper, we aim to provide theoretical justifications for Bayesian MAML's advantageous performance by comparing the meta test risks of MAML and Bayesian MAML. In the meta linear regression, under both the distribution agnostic and linear centroid cases, we have established that Bayesian MAML indeed has provably lower meta test risks than MAML. We verify our theoretical results through experiments.
    Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Mat\'ern Correlations. (arXiv:2203.03116v1 [stat.ML])
    We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Mat\'ern correlations whose smoothness parameter $\nu$ is a half-integer. The proposed algorithm only requires $\mathcal{O}(\nu^3 n)$ operations and $\mathcal{O}(\nu n)$ storage. This leads to a linear-cost solver since $\nu$ is chosen to be fixed and usually very small in most applications. The proposed method can be applied to multi-dimensional problems if a full grid or a sparse grid design is used. The proposed method is based on a novel theory for Mat\'ern correlation functions. We find that a suitable rearrangement of these correlation functions can produce a compactly supported function, called a "kernel packet". Using a set of kernel packets as basis functions leads to a sparse representation of the covariance matrix that results in the proposed algorithm. Simulation studies show that the proposed algorithm, when applicable, is significantly superior to the existing alternatives in both the computational time and predictive accuracy.
    Performance Analysis of Electrical Machines Using a Hybrid Data- and Physics-Driven Model. (arXiv:2201.09603v2 [cs.LG] UPDATED)
    In the design phase of an electrical machine, finite element (FE) simulation are commonly used to numerically optimize the performance. The output of the magneto-static FE simulation characterizes the electromagnetic behavior of the electrical machine. It usually includes intermediate measures such as nonlinear iron losses, electromagnetic torque, and flux values at each operating point to compute the key performance indicators (KPIs). We present a data-driven deep learning approach that replaces the computationally heavy FE calculations by a deep neural network (DNN). The DNN is trained by a large volume of stored FE data in a supervised manner. During the learning process, the network response (intermediate measures) is fed as input to a physics-based post-processing to estimate characteristic maps and KPIs. Results indicate that the predictions of intermediate measures and the subsequent computations of KPIs are close to the ground truth for new machine designs. We show that this hybrid approach yields flexibility in the simulation process. Finally, the proposed hybrid approach is quantitatively compared to existing deep neural network-based direct prediction approach of KPIs.
    Enhancing Column Generation by a Machine-Learning-Based Pricing Heuristic for Graph Coloring. (arXiv:2112.04906v2 [math.OC] UPDATED)
    Column Generation (CG) is an effective method for solving large-scale optimization problems. CG starts by solving a sub-problem with a subset of columns (i.e., variables) and gradually includes new columns that can improve the solution of the current subproblem. The new columns are generated as needed by repeatedly solving a pricing problem, which is often NP-hard and is a bottleneck of the CG approach. To tackle this, we propose a Machine-Learning-based Pricing Heuristic (MLPH)that can generate many high-quality columns efficiently. In each iteration of CG, our MLPH leverages an ML model to predict the optimal solution of the pricing problem, which is then used to guide a sampling method to efficiently generate multiple high-quality columns. Using the graph coloring problem, we empirically show that MLPH significantly enhancesCG as compared to six state-of-the-art methods, and the improvement in CG can lead to substantially better performance of the branch-and-price exact method.
    On the Construction of Distribution-Free Prediction Intervals for an Image Regression Problem in Semiconductor Manufacturing. (arXiv:2203.03150v1 [cs.CV])
    The high-volume manufacturing of the next generation of semiconductor devices requires advances in measurement signal analysis. Many in the semiconductor manufacturing community have reservations about the adoption of deep learning; they instead prefer other model-based approaches for some image regression problems, and according to the 2021 IEEE International Roadmap for Devices and Systems (IRDS) report on Metrology a SEMI standardization committee may endorse this philosophy. The semiconductor manufacturing community does, however, communicate a need for state-of-the-art statistical analyses to reduce measurement uncertainty. Prediction intervals which characterize the reliability of the predictive performance of regression models can impact decisions, build trust in machine learning, and be applied to other regression models. However, we are not aware of effective and sufficiently simple distribution-free approaches that offer valid coverage for important classes of image data, so we consider the distribution-free conformal prediction and conformalized quantile regression framework.The image regression problem that is the focus of this paper pertains to line edge roughness (LER) estimation from noisy scanning electron microscopy images. LER affects semiconductor device performance and reliability as well as the yield of the manufacturing process; the 2021 IRDS emphasizes the crucial importance of LER by devoting a white paper to it in addition to mentioning or discussing it in the reports of multiple international focus teams. It is not immediately apparent how to effectively use normalized conformal prediction and quantile regression for LER estimation. The modeling techniques we apply appear to be novel for finding distribution-free prediction intervals for image data and will be presented at the 2022 SEMI Advanced Semiconductor Manufacturing Conference.
    FedProto: Federated Prototype Learning across Heterogeneous Clients. (arXiv:2105.00243v4 [cs.LG] UPDATED)
    Heterogeneity across clients in federated learning (FL) usually hinders the optimization convergence and generalization performance when the aggregation of clients' knowledge occurs in the gradient space. For example, clients may differ in terms of data distribution, network latency, input/output space, and/or model architecture, which can easily lead to the misalignment of their local gradients. To improve the tolerance to heterogeneity, we propose a novel federated prototype learning (FedProto) framework in which the clients and server communicate the abstract class prototypes instead of the gradients. FedProto aggregates the local prototypes collected from different clients, and then sends the global prototypes back to all clients to regularize the training of local models. The training on each client aims to minimize the classification error on the local data while keeping the resulting local prototypes sufficiently close to the corresponding global ones. Moreover, we provide a theoretical analysis to the convergence rate of FedProto under non-convex objectives. In experiments, we propose a benchmark setting tailored for heterogeneous FL, with FedProto outperforming several recent FL approaches on multiple datasets.
    Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly. (arXiv:2110.04450v2 [cs.RO] UPDATED)
    Studies in robot teleoperation have been centered around action specifications -- from continuous joint control to discrete end-effector pose control. However, these robot-centric interfaces often require skilled operators with extensive robotics expertise. To make teleoperation accessible to non-expert users, we propose the framework "Scene Editing as Teleoperation" (SEaT), where the key idea is to transform the traditional "robot-centric" interface into a "scene-centric" interface -- instead of controlling the robot, users focus on specifying the task's goal by manipulating digital twins of the real-world objects. As a result, a user can perform teleoperation without any expert knowledge of the robot hardware. To achieve this goal, we utilize a category-agnostic scene-completion algorithm that translates the real-world workspace (with unknown objects) into a manipulable virtual scene representation and an action-snapping algorithm that refines the user input before generating the robot's action plan. To train the algorithms, we procedurally generated a large-scale, diverse kit-assembly dataset that contains object-kit pairs that mimic real-world object-kitting tasks. Our experiments in simulation and on a real-world system demonstrate that our framework improves both the efficiency and success rate for 6DoF kit-assembly tasks. A user study demonstrates that SEaT framework participants achieve a higher task success rate and report a lower subjective workload compared to an alternative robot-centric interface. Video can be found at https://www.youtube.com/watch?v=-NdR3mkPbQQ .
    Detecting data-driven robust statistical arbitrage strategies with deep neural networks. (arXiv:2203.03179v1 [q-fin.CP])
    We present an approach, based on deep neural networks, that allows identifying robust statistical arbitrage strategies in financial markets. Robust statistical arbitrage strategies refer to self-financing trading strategies that enable profitable trading under model ambiguity. The presented novel methodology does not suffer from the curse of dimensionality nor does it depend on the identification of cointegrated pairs of assets and is therefore applicable even on high-dimensional financial markets or in markets where classical pairs trading approaches fail. Moreover, we provide a method to build an ambiguity set of admissible probability measures that can be derived from observed market data. Thus, the approach can be considered as being model-free and entirely data-driven. We showcase the applicability of our method by providing empirical investigations with highly profitable trading performances even in 50 dimensions, during financial crises, and when the cointegration relationship between asset pairs stops to persist.
    Double-Adversarial Activation Anomaly Detection: Adversarial Autoencoders are Anomaly Generators. (arXiv:2101.04645v3 [cs.LG] UPDATED)
    Anomaly detection is a challenging task for machine learning algorithms due to the inherent class imbalance. It is costly and time-demanding to manually analyse the observed data, thus usually only few known anomalies if any are available. Inspired by generative models and the analysis of the hidden activations of neural networks, we introduce a novel unsupervised anomaly detection method called DA3D. Here, we use adversarial autoencoders to generate anomalous counterexamples based on the normal data only. These artificial anomalies used during training allow the detection of real, yet unseen anomalies. With our novel generative approach, we transform the unsupervised task of anomaly detection to a supervised one, which is more tractable by machine learning and especially deep learning methods. DA3D surpasses the performance of state-of-the-art anomaly detection methods in a purely data-driven way, where no domain knowledge is required.
    Ensemble perspective for understanding temporal credit assignment. (arXiv:2102.03740v2 [cond-mat.dis-nn] UPDATED)
    Recurrent neural networks are widely used for modeling spatio-temporal sequences in both nature language processing and neural population dynamics. However, understanding the temporal credit assignment is hard. Here, we propose that each individual connection in the recurrent computation is modeled by a spike and slab distribution, rather than a precise weight value. We then derive the mean-field algorithm to train the network at the ensemble level. The method is then applied to classify handwritten digits when pixels are read in sequence, and to the multisensory integration task that is a fundamental cognitive function of animals. Our model reveals important connections that determine the overall performance of the network. The model also shows how spatio-temporal information is processed through the hyperparameters of the distribution, and moreover reveals distinct types of emergent neural selectivity. To provide a mechanistic analysis of the ensemble learning, we first derive an analytic solution of the learning at the infinitely-large-network limit. We then carry out a low-dimensional projection of both neural and synaptic dynamics, analyze symmetry breaking in the parameter space, and finally demonstrate the role of stochastic plasticity in the recurrent computation. Therefore, our study sheds light on mechanisms of how weight uncertainty impacts the temporal credit assignment in recurrent neural networks from the ensemble perspective.
    Sparse Bayesian Learning with Diagonal Quasi-Newton Method for Large Scale Classification. (arXiv:2107.08195v3 [cs.LG] UPDATED)
    Sparse Bayesian Learning (SBL) constructs an extremely sparse probabilistic model with very competitive generalization. However, SBL needs to invert a big covariance matrix with complexity O(M^3 ) (M: feature size) for updating the regularization priors, making it difficult for practical use. There are three issues in SBL: 1) Inverting the covariance matrix may obtain singular solutions in some cases, which hinders SBL from convergence; 2) Poor scalability to problems with high dimensional feature space or large data size; 3) SBL easily suffers from memory overflow for large-scale data. This paper addresses these issues with a newly proposed diagonal Quasi-Newton (DQN) method for SBL called DQN-SBL where the inversion of big covariance matrix is ignored so that the complexity and memory storage are reduced to O(M). The DQN-SBL is thoroughly evaluated on non-linear classifiers and linear feature selection using various benchmark datasets of different sizes. Experimental results verify that DQN-SBL receives competitive generalization with a very sparse model and scales well to large-scale problems.
    Temporal-Relational Hypergraph Tri-Attention Networks for Stock Trend Prediction. (arXiv:2107.14033v2 [q-fin.ST] UPDATED)
    Predicting the future price trends of stocks is a challenging yet intriguing problem given its critical role to help investors make profitable decisions. In this paper, we present a collaborative temporal-relational modeling framework for end-to-end stock trend prediction. The temporal dynamics of stocks is firstly captured with an attention-based recurrent neural network. Then, different from existing studies relying on the pairwise correlations between stocks, we argue that stocks are naturally connected as a collective group, and introduce the hypergraph structures to jointly characterize the stock group-wise relationships of industry-belonging and fund-holding. A novel hypergraph tri-attention network (HGTAN) is proposed to augment the hypergraph convolutional networks with a hierarchical organization of intra-hyperedge, inter-hyperedge, and inter-hypergraph attention modules. In this manner, HGTAN adaptively determines the importance of nodes, hyperedges, and hypergraphs during the information propagation among stocks, so that the potential synergies between stock movements can be fully exploited. Extensive experiments on real-world data demonstrate the effectiveness of our approach. Also, the results of investment simulation show that our approach can achieve a more desirable risk-adjusted return. The data and codes of our work have been released at https://github.com/lixiaojieff/HGTAN.
    Virtual vs. Reality: External Validation of COVID-19 Classifiers using XCAT Phantoms for Chest Computed Tomography. (arXiv:2203.03074v1 [eess.IV])
    Research studies of artificial intelligence models in medical imaging have been hampered by poor generalization. This problem has been especially concerning over the last year with numerous applications of deep learning for COVID-19 diagnosis. Virtual imaging trials (VITs) could provide a solution for objective evaluation of these models. In this work utilizing the VITs, we created the CVIT-COVID dataset including 180 virtually imaged computed tomography (CT) images from simulated COVID-19 and normal phantom models under different COVID-19 morphology and imaging properties. We evaluated the performance of an open-source, deep-learning model from the University of Waterloo trained with multi-institutional data and an in-house model trained with the open clinical dataset called MosMed. We further validated the model's performance against open clinical data of 305 CT images to understand virtual vs. real clinical data performance. The open-source model was published with nearly perfect performance on the original Waterloo dataset but showed a consistent performance drop in external testing on another clinical dataset (AUC=0.77) and our simulated CVIT-COVID dataset (AUC=0.55). The in-house model achieved an AUC of 0.87 while testing on the internal test set (MosMed test set). However, performance dropped to an AUC of 0.65 and 0.69 when evaluated on clinical and our simulated CVIT-COVID dataset. The VIT framework offered control over imaging conditions, allowing us to show there was no change in performance as CT exposure was changed from 28.5 to 57 mAs. The VIT framework also provided voxel-level ground truth, revealing that performance of in-house model was much higher at AUC=0.87 for diffuse COVID-19 infection size >2.65% lung volume versus AUC=0.52 for focal disease with <2.65% volume. The virtual imaging framework enabled these uniquely rigorous analyses of model performance.
    STL Robustness Risk over Discrete-Time Stochastic Processes. (arXiv:2104.01503v4 [eess.SY] UPDATED)
    We present a framework to interpret signal temporal logic (STL) formulas over discrete-time stochastic processes in terms of the induced risk. Each realization of a stochastic process either satisfies or violates an STL formula. In fact, we can assign a robustness value to each realization that indicates how robustly this realization satisfies an STL formula. We then define the risk of a stochastic process not satisfying an STL formula robustly, referred to as the STL robustness risk. In our definition, we permit general classes of risk measures such as, but not limited to, the conditional value-at-risk. While in general hard to compute, we propose an approximation of the STL robustness risk. This approximation has the desirable property of being an upper bound of the STL robustness risk when the chosen risk measure is monotone, a property satisfied by most risk measures. Motivated by the interest in data-driven approaches, we present a sampling-based method for estimating the approximate STL robustness risk from data for the value-at-risk. While we consider the value-at-risk, we highlight that such sampling-based methods are viable for other risk measures.
    Evaluation of Interpretability Methods and Perturbation Artifacts in Deep Neural Networks. (arXiv:2203.02928v1 [cs.LG])
    The challenge of interpreting predictions from deep neural networks has prompted the development of numerous interpretability methods. Many of interpretability methods attempt to quantify the importance of input features with respect to the class probabilities, and are called importance estimators or saliency maps. A popular approach to evaluate such interpretability methods is to perturb input features deemed important for predictions and observe the decrease in accuracy. However, perturbation-based evaluation methods may confound the sources of accuracy degradation. We conduct computational experiments that allow to empirically estimate the $\textit{fidelity}$ of interpretability methods and the contribution of perturbation artifacts. All considered importance estimators clearly outperform a random baseline, which contradicts the findings of ROAR [arXiv:1806.10758]. We further compare our results to the crop-and-resize evaluation framework [arXiv:1705.07857], which are largely in agreement. Our study suggests that we can estimate the impact of artifacts and thus empirically evaluate interpretability methods without retraining.
    On the Convergence Rates of Policy Gradient Methods. (arXiv:2201.07443v2 [math.OC] UPDATED)
    We consider infinite-horizon discounted Markov decision problems with finite state and action spaces and study the convergence rates of the projected policy gradient method and a general class of policy mirror descent methods, all with direct parametrization in the policy space. First, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method. Then we show that with geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.
    A Review of Emerging Research Directions in Abstract Visual Reasoning. (arXiv:2202.10284v2 [cs.AI] UPDATED)
    Abstract Visual Reasoning (AVR) problems are commonly used to approximate human intelligence. They test the ability of applying previously gained knowledge, experience and skills in a completely new setting, which makes them particularly well-suited for this task. Recently, the AVR problems have become popular as a proxy to study machine intelligence, which has led to emergence of new distinct types of problems and multiple benchmark sets. In this work we review this emerging AVR research and propose a taxonomy to categorise the AVR tasks along 5 dimensions: input shapes, hidden rules, target task, cognitive function, and main challenge. The perspective taken in this survey allows to characterise AVR problems with respect to their shared and distinct properties, provides a unified view on the existing approaches for solving AVR tasks, shows how the AVR problems relate to practical applications, and outlines promising directions for future work. One of them refers to the observation that in the machine learning literature different tasks are considered in isolation, which is in the stark contrast with the way the AVR tasks are used to measure human intelligence, where multiple types of problems are combined within a single IQ test.
    Depthwise Convolution for Multi-Agent Communication with Enhanced Mean-Field Approximation. (arXiv:2203.02896v1 [cs.LG])
    Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC).
    Searching for Robust Neural Architectures via Comprehensive and Reliable Evaluation. (arXiv:2203.03128v1 [cs.LG])
    Neural architecture search (NAS) could help search for robust network architectures, where defining robustness evaluation metrics is the important procedure. However, current robustness evaluations in NAS are not sufficiently comprehensive and reliable. In particular, the common practice only considers adversarial noise and quantified metrics such as the Jacobian matrix, whereas, some studies indicated that the models are also vulnerable to other types of noises such as natural noise. In addition, existing methods taking adversarial noise as the evaluation just use the robust accuracy of the FGSM or PGD, but these adversarial attacks could not provide the adequately reliable evaluation, leading to the vulnerability of the models under stronger attacks. To alleviate the above problems, we propose a novel framework, called Auto Adversarial Attack and Defense (AAAD), where we employ neural architecture search methods, and four types of robustness evaluations are considered, including adversarial noise, natural noise, system noise and quantified metrics, thereby assisting in finding more robust architectures. Also, among the adversarial noise, we use the composite adversarial attack obtained by random search as the new metric to evaluate the robustness of the model architectures. The empirical results on the CIFAR10 dataset show that the searched efficient attack could help find more robust architectures.
    Don't Be So Dense: Sparse-to-Sparse GAN Training Without Sacrificing Performance. (arXiv:2203.02770v1 [cs.CV])
    Generative adversarial networks (GANs) have received an upsurging interest since being proposed due to the high quality of the generated data. While achieving increasingly impressive results, the resource demands associated with the large model size hinders the usage of GANs in resource-limited scenarios. For inference, the existing model compression techniques can reduce the model complexity with comparable performance. However, the training efficiency of GANs has less been explored due to the fragile training process of GANs. In this paper, we, for the first time, explore the possibility of directly training sparse GAN from scratch without involving any dense or pre-training steps. Even more unconventionally, our proposed method enables directly training sparse unbalanced GANs with an extremely sparse generator from scratch. Instead of training full GANs, we start with sparse GANs and dynamically explore the parameter space spanned over the generator throughout training. Such a sparse-to-sparse training procedure enhances the capacity of the highly sparse generator progressively while sticking to a fixed small parameter budget with appealing training and inference efficiency gains. Extensive experiments with modern GAN architectures validate the effectiveness of our method. Our sparsified GANs, trained from scratch in one single run, are able to outperform the ones learned by expensive iterative pruning and re-training. Perhaps most importantly, we find instead of inheriting parameters from expensive pre-trained GANs, directly training sparse GANs from scratch can be a much more efficient solution. For example, only training with a 80% sparse generator and a 70% sparse discriminator, our method can achieve even better performance than the dense BigGAN.
    Node-Variant Graph Filters in Graph Neural Networks. (arXiv:2106.00089v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been successfully employed in a myriad of applications involving graph signals. Theoretical findings establish that GNNs use nonlinear activation functions to create low-eigenvalue frequency content that can be processed in a stable manner by subsequent graph convolutional filters. However, the exact shape of the frequency content created by nonlinear functions is not known and cannot be learned. In this work, we use node-variant graph filters (NVGFs) -- which are linear filters capable of creating frequencies -- as a means of investigating the role that frequency creation plays in GNNs. We show that, by replacing nonlinear activation functions by NVGFs, frequency creation mechanisms can be designed or learned. By doing so, the role of frequency creation is separated from the nonlinear nature of traditional GNNs. Simulations on graph signal processing problems are carried out to pinpoint the role of frequency creation.
    HEAR 2021: Holistic Evaluation of Audio Representations. (arXiv:2203.03022v1 [cs.SD])
    What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.
    On the importance of stationarity, strong baselines and benchmarks in transport prediction problems. (arXiv:2203.02954v1 [stat.ML])
    Over the last years, the transportation community has witnessed a tremendous amount of research contributions on new deep learning approaches for spatio-temporal forecasting. These contributions tend to emphasize the modeling of spatial correlations, while neglecting the fairly stable and recurrent nature of human mobility patterns. In this short paper, we show that a naive baseline method based on the average weekly pattern and linear regression can achieve comparable results to many state-of-the-art deep learning approaches for spatio-temporal forecasting in transportation, or even outperform them on several datasets, thus contrasting the importance of stationarity and recurrent patterns in the data with the importance of spatial correlations. Furthermore, we establish 9 different reference benchmarks that can be used to compare new approaches for spatio-temporal forecasting, and provide a discussion on best practices and the direction that the field is taking.
    Detection of Parasitic Eggs from Microscopy Images and the emergence of a new dataset. (arXiv:2203.02940v1 [cs.CV])
    Automatic detection of parasitic eggs in microscopy images has the potential to increase the efficiency of human experts whilst also providing an objective assessment. The time saved by such a process would both help ensure a prompt treatment to patients, and off-load excessive work from experts' shoulders. Advances in deep learning inspired us to exploit successful architectures for detection, adapting them to tackle a different domain. We propose a framework that exploits two such state-of-the-art models. Specifically, we demonstrate results produced by both a Generative Adversarial Network (GAN) and Faster-RCNN, for image enhancement and object detection respectively, on microscopy images of varying quality. The use of these techniques yields encouraging results, though further improvements are still needed for certain egg types whose detection still proves challenging. As a result, a new dataset has been created and made publicly available, providing an even wider range of classes and variability.
    Flurry: a Fast Framework for Reproducible Multi-layered Provenance Graph Representation Learning. (arXiv:2203.02744v1 [cs.CR])
    Complex heterogeneous dynamic networks like knowledge graphs are powerful constructs that can be used in modeling data provenance from computer systems. From a security perspective, these attributed graphs enable causality analysis and tracing for analyzing a myriad of cyberattacks. However, there is a paucity in systematic development of pipelines that transform system executions and provenance into usable graph representations for machine learning tasks. This lack of instrumentation severely inhibits scientific advancement in provenance graph machine learning by hindering reproducibility and limiting the availability of data that are critical for techniques like graph neural networks. To fulfill this need, we present Flurry, an end-to-end data pipeline which simulates cyberattacks, captures provenance data from these attacks at multiple system and application layers, converts audit logs from these attacks into data provenance graphs, and incorporates this data with a framework for training deep neural models that supports preconfigured or custom-designed models for analysis in real-world resilient systems. We showcase this pipeline by processing data from multiple system attacks and performing anomaly detection via graph classification using current benchmark graph representational learning frameworks. Flurry provides a fast, customizable, extensible, and transparent solution for providing this much needed data to cybersecurity professionals.
    The Impact of Differential Privacy on Group Disparity Mitigation. (arXiv:2203.02745v1 [cs.CR])
    The performance cost of differential privacy has, for some applications, been shown to be higher for minority groups; fairness, conversely, has been shown to disproportionally compromise the privacy of members of such groups. Most work in this area has been restricted to computer vision and risk assessment. In this paper, we evaluate the impact of differential privacy on fairness across four tasks, focusing on how attempts to mitigate privacy violations and between-group performance differences interact: Does privacy inhibit attempts to ensure fairness? To this end, we train $(\varepsilon,\delta)$-differentially private models with empirical risk minimization and group distributionally robust training objectives. Consistent with previous findings, we find that differential privacy increases between-group performance differences in the baseline setting; but more interestingly, differential privacy reduces between-group performance differences in the robust setting. We explain this by reinterpreting differential privacy as regularization.
    VGAER: Graph Neural Network Reconstruction based Community Detection. (arXiv:2201.04066v4 [cs.SI] UPDATED)
    Community detection is a fundamental and important issue in network science, but there are only a few community detection algorithms based on graph neural networks, among which unsupervised algorithms are almost blank. By fusing the high-order modularity information with network features, this paper proposes a Variational Graph AutoEncoder Reconstruction based community detection VGAER for the first time, and gives its non-probabilistic version. They do not need any prior information. We have carefully designed corresponding input features, decoder, and downstream tasks based on the community detection task and these designs are concise, natural, and perform well (NMI values under our design are improved by 59.1% - 565.9%). Based on a series of experiments with wide range of datasets and advanced methods, VGAER has achieved superior performance and shows strong competitiveness and potential with a simpler design. Finally, we report the results of algorithm convergence analysis and t-SNE visualization, which clearly depicted the stable performance and powerful network modularity ability of VGAER. Our codes are available at https://github.com/qcydm/VGAER.
    Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction. (arXiv:2111.00064v2 [cs.LG] UPDATED)
    Learning on graphs has attracted significant attention in the learning community due to numerous real-world applications. In particular, graph neural networks (GNNs), which take numerical node features and graph structure as inputs, have been shown to achieve state-of-the-art performance on various graph-related learning tasks. Recent works exploring the correlation between numerical node features and graph structure via self-supervised learning have paved the way for further performance improvements of GNNs. However, methods used for extracting numerical node features from raw data are still graph-agnostic within standard GNN pipelines. This practice is sub-optimal as it prevents one from fully utilizing potential correlations between graph topology and node attributes. To mitigate this issue, we propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT). GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information, and scales to large datasets. We also provide a theoretical analysis that justifies the use of XMC over link prediction and motivates integrating XR-Transformers, a powerful method for solving XMC problems, into the GIANT framework. We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets: For example, we improve the accuracy of the top-ranked method GAMLP from $68.25\%$ to $69.67\%$, SGC from $63.29\%$ to $66.10\%$ and MLP from $47.24\%$ to $61.10\%$ on the ogbn-papers100M dataset by leveraging GIANT.
    HintNet: Hierarchical Knowledge Transfer Networks for Traffic Accident Forecasting on Heterogeneous Spatio-Temporal Data. (arXiv:2203.03100v1 [cs.LG])
    Traffic accident forecasting is a significant problem for transportation management and public safety. However, this problem is challenging due to the spatial heterogeneity of the environment and the sparsity of accidents in space and time. The occurrence of traffic accidents is affected by complex dependencies among spatial and temporal features. Recent traffic accident prediction methods have attempted to use deep learning models to improve accuracy. However, most of these methods either focus on small-scale and homogeneous areas such as populous cities or simply use sliding-window-based ensemble methods, which are inadequate to handle heterogeneity in large regions. To address these limitations, this paper proposes a novel Hierarchical Knowledge Transfer Network (HintNet) model to better capture irregular heterogeneity patterns. HintNet performs a multi-level spatial partitioning to separate sub-regions with different risks and learns a deep network model for each level using spatio-temporal and graph convolutions. Through knowledge transfer across levels, HintNet archives both higher accuracy and higher training efficiency. Extensive experiments on a real-world accident dataset from the state of Iowa demonstrate that HintNet outperforms the state-of-the-art methods on spatially heterogeneous and large-scale areas.
    MURANA: A Generic Framework for Stochastic Variance-Reduced Optimization. (arXiv:2106.03056v2 [math.OC] UPDATED)
    We propose a generic variance-reduced algorithm, which we call MUltiple RANdomized Algorithm (MURANA), for minimizing a sum of several smooth functions plus a regularizer, in a sequential or distributed manner. Our method is formulated with general stochastic operators, which allow us to model various strategies for reducing the computational complexity. For example, MURANA supports sparse activation of the gradients, and also reduction of the communication load via compression of the update vectors. This versatility allows MURANA to cover many existing randomization mechanisms within a unified framework. However, MURANA also encodes new methods as special cases. We highlight one of them, which we call ELVIRA, and show that it improves upon Loopless SVRG.
    HAR-GCNN: Deep Graph CNNs for Human Activity Recognition From Highly Unlabeled Mobile Sensor Data. (arXiv:2203.03087v1 [cs.CV])
    The problem of human activity recognition from mobile sensor data applies to multiple domains, such as health monitoring, personal fitness, daily life logging, and senior care. A critical challenge for training human activity recognition models is data quality. Acquiring balanced datasets containing accurate activity labels requires humans to correctly annotate and potentially interfere with the subjects' normal activities in real-time. Despite the likelihood of incorrect annotation or lack thereof, there is often an inherent chronology to human behavior. For example, we take a shower after we exercise. This implicit chronology can be used to learn unknown labels and classify future activities. In this work, we propose HAR-GCCN, a deep graph CNN model that leverages the correlation between chronologically adjacent sensor measurements to predict the correct labels for unclassified activities that have at least one activity label. We propose a new training strategy enforcing that the model predicts the missing activity labels by leveraging the known ones. HAR-GCCN shows superior performance relative to previously used baseline methods, improving classification accuracy by about 25% and up to 68% on different datasets. Code is available at \url{https://github.com/abduallahmohamed/HAR-GCNN}.
    GPU Semiring Primitives for Sparse Neighborhood Methods. (arXiv:2104.06357v2 [cs.LG] UPDATED)
    High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations. We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive is a foundational component for enabling many neighborhood-based information retrieval and machine learning algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides a good baseline for future research in this area. Our implementation is fully open source and publicly available as part of the RAFT library of GPU-accelerated machine learning primitives (https://github.com/rapidsai/raft).
    A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits. (arXiv:2201.06532v2 [cs.LG] UPDATED)
    We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
    SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training. (arXiv:2201.10207v3 [eess.AS] UPDATED)
    We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for BASE model, 65% for LARGE model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. Source code is available at https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v2 [cs.LG] UPDATED)
    Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Regularization in network optimization via trimmed stochastic gradient descent with noisy label. (arXiv:2012.11073v2 [cs.LG] UPDATED)
    Regularization is essential for avoiding over-fitting to training data in network optimization, leading to better generalization of the trained networks. The label noise provides a strong implicit regularization by replacing the target ground truth labels of training examples by uniform random labels. However, it can cause undesirable misleading gradients due to the large loss associated with incorrect labels. We propose a first-order optimization method (Label-Noised Trim-SGD) that uses the label noise with the example trimming in order to remove the outliers based on the loss. The proposed algorithm is simple yet enables us to impose a large label-noise and obtain a better regularization effect than the original methods. The quantitative analysis is performed by comparing the behavior of the label noise, the example trimming, and the proposed algorithm. We also present empirical results that demonstrate the effectiveness of our algorithm using the major benchmarks and the fundamental networks, where our method has successfully outperformed the state-of-the-art optimization methods.
    TSGB: Target-Selective Gradient Backprop for Probing CNN Visual Saliency. (arXiv:2110.05182v2 [cs.CV] UPDATED)
    The explanation for deep neural networks has drawn extensive attention in the deep learning community over the past few years. In this work, we study the visual saliency, a.k.a. visual explanation, to interpret convolutional neural networks. Compared to iteration based saliency methods, single backward pass based saliency methods benefit from faster speed, and they are widely used in downstream visual tasks. Thus, we focus on single backward pass based methods. However, existing methods in this category struggle to uccessfully produce fine-grained saliency maps concentrating on specific target classes. That said, producing faithful saliency maps satisfying both target-selectiveness and fine-grainedness using a single backward pass is a challenging problem in the field. To mitigate this problem, we revisit the gradient flow inside the network, and find that the entangled semantics and original weights may disturb the propagation of target-relevant saliency. Inspired by those observations, we propose a novel visual saliency method, termed Target-Selective Gradient Backprop (TSGB), which leverages rectification operations to effectively emphasize target classes and further efficiently propagate the saliency to the image space, thereby generating target-selective and fine-grained saliency maps. The proposed TSGB consists of two components, namely, TSGB-Conv and TSGB-FC, which rectify the gradients for convolutional layers and fully-connected layers, respectively. Extensive qualitative and quantitative experiments on the ImageNet and Pascal VOC datasets show that the proposed method achieves more accurate and reliable results than the other competitive methods. Code is available at https://github.com/123fxdx/CNNvisualizationTSGB.
    A Robust Spectral Algorithm for Overcomplete Tensor Decomposition. (arXiv:2203.02790v1 [cs.LG])
    We give a spectral algorithm for decomposing overcomplete order-4 tensors, so long as their components satisfy an algebraic non-degeneracy condition that holds for nearly all (all but an algebraic set of measure $0$) tensors over $(\mathbb{R}^d)^{\otimes 4}$ with rank $n \le d^2$. Our algorithm is robust to adversarial perturbations of bounded spectral norm. Our algorithm is inspired by one which uses the sum-of-squares semidefinite programming hierarchy (Ma, Shi, and Steurer STOC'16, arXiv:1610.01980), and we achieve comparable robustness and overcompleteness guarantees under similar algebraic assumptions. However, our algorithm avoids semidefinite programming and may be implemented as a series of basic linear-algebraic operations. We consequently obtain a much faster running time than semidefinite programming methods: our algorithm runs in time $\tilde O(n^2d^3) \le \tilde O(d^7)$, which is subquadratic in the input size $d^4$ (where we have suppressed factors related to the condition number of the input tensor).
    On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty. (arXiv:2102.11409v3 [cs.LG] UPDATED)
    Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes "far-away" data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL's feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertainty methods, while maintaining the speed and accuracy of standard neural networks.
    Bathymetry Inversion using a Deep-Learning-Based Surrogate for Shallow Water Equations Solvers. (arXiv:2203.02821v1 [physics.flu-dyn])
    River bathymetry is critical for many aspects of water resources management. We propose and demonstrate a bathymetry inversion method using a deep-learning-based surrogate for shallow water equations solvers. The surrogate uses the convolutional autoencoder with a shared-encoder, separate-decoder architecture. It encodes the input bathymetry and decodes to separate outputs for flow-field variables. A gradient-based optimizer is used to perform bathymetry inversion with the trained surrogate. Two physically-based constraints on both bed elevation value and slope have to be added as inversion loss regularizations to obtain usable inversion results. Using the "L-curve" criterion, a heuristic approach was proposed to determine the regularization parameters. Both the surrogate model and the inversion algorithm show good performance. We found the bathymetry inversion process has two distinctive stages, which resembles the sculptural process of initial broad-brush calving and final detailing. The inversion loss due to flow prediction error reaches its minimum in the first stage and remains almost constant afterward. The bed elevation value and slope regularizations play the dominant role in the second stage in selecting the most probable solution. We also found the surrogate architecture (whether with both velocity and water surface elevation or velocity only as outputs) does not show significant impact on inversion result.
    Equal Bits: Enforcing Equally Distributed Binary Network Weights. (arXiv:2112.03406v2 [cs.LG] UPDATED)
    Binary networks are extremely efficient as they use only two symbols to define the network: $\{+1,-1\}$. One can make the prior distribution of these symbols a design choice. The recent IR-Net of Qin et al. argues that imposing a Bernoulli distribution with equal priors (equal bit ratios) over the binary weights leads to maximum entropy and thus minimizes information loss. However, prior work cannot precisely control the binary weight distribution during training, and therefore cannot guarantee maximum entropy. Here, we show that quantizing using optimal transport can guarantee any bit ratio, including equal ratios. We investigate experimentally that equal bit ratios are indeed preferable and show that our method leads to optimization benefits. We show that our quantization method is effective when compared to state-of-the-art binarization methods, even when using binary weight pruning.
    Using Deep Learning to Bootstrap Abstractions for Hierarchical Robot Planning. (arXiv:2202.00907v3 [cs.RO] UPDATED)
    This paper addresses the problem of learning abstractions that boost robot planning performance while providing strong guarantees of reliability. Although state-of-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts. We present a new approach for bootstrapping the entire hierarchical planning process. This allows us to compute abstract states and actions for new environments automatically using the critical regions predicted by a deep neural network with an auto-generated robot-specific architecture. We show that the learned abstractions can be used with a novel multi-source bi-directional hierarchical robot planning algorithm that is sound and probabilistically complete. An extensive empirical evaluation on twenty different settings using holonomic and non-holonomic robots shows that (a) our learned abstractions provide the information necessary for efficient multi-source hierarchical planning; and that (b) this approach of learning, abstractions, and planning outperforms state-of-the-art baselines by nearly a factor of ten in terms of planning time on test environments not seen during training.
    Towards a Responsible AI Development Lifecycle: Lessons From Information Security. (arXiv:2203.02958v1 [cs.AI])
    Legislation and public sentiment throughout the world have promoted fairness metrics, explainability, and interpretability as prescriptions for the responsible development of ethical artificial intelligence systems. Despite the importance of these three pillars in the foundation of the field, they can be challenging to operationalize and attempts to solve the problems in production environments often feel Sisyphean. This difficulty stems from a number of factors: fairness metrics are computationally difficult to incorporate into training and rarely alleviate all of the harms perpetrated by these systems. Interpretability and explainability can be gamed to appear fair, may inadvertently reduce the privacy of personal information contained in training data, and increase user confidence in predictions -- even when the explanations are wrong. In this work, we propose a framework for responsibly developing artificial intelligence systems by incorporating lessons from the field of information security and the secure development lifecycle to overcome challenges associated with protecting users in adversarial settings. In particular, we propose leveraging the concepts of threat modeling, design review, penetration testing, and incident response in the context of developing AI systems as ways to resolve shortcomings in the aforementioned methods.
    Early Stopping for Deep Image Prior. (arXiv:2112.06074v2 [cs.CV] UPDATED)
    Deep image prior (DIP) and its variants have showed remarkable potential for solving inverse problems in computer vision, without any extra training data. Practical DIP models are often substantially overparameterized. During the fitting process, these models learn mostly the desired visual content first, and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP often depends critically on good early stopping (ES) that captures the transition period. In this regard, the majority of DIP works for vision tasks only demonstrates the potential of the models -- reporting the peak performance against the ground truth, but provides no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy, which consistently detects near-peak performance across several vision tasks and DIP variants. Based on a simple measure of dispersion of consecutive DIP reconstructions, our ES method not only outpaces the existing ones -- which only work in very narrow domains, but also remains effective when combined with a number of methods that try to mitigate the overfitting. The code is available at https://github.com/sun-umn/Early_Stopping_for_DIP.
    Graph Neural Network Potential for Magnetic Materials. (arXiv:2203.02853v1 [physics.comp-ph])
    Machine Learning (ML) interatomic potential has shown its great power in condensed matter physics. However, ML interatomic potential for a magnetic system including both structural degrees of freedom and magnetic moments has not been well developed yet. A spin-dependent ML interatomic potential approach based on the crystal graph neural network (GNN) has been developed for any magnetic system. It consists of the Heisenberg edge graph neural network (HEGNN) and spin-distance edge graph neural network (SEGNN). The network captures the Heisenberg coefficient variation between different structures and the fine spin-lattice coupling of high order and multi-body interaction with high accuracy. In the tests, this method perfectly fitted a high-order spin Hamiltonian and two complex spin-lattice Hamiltonian and captured the fine spin-lattice coupling in BiFeO3. In addition, a disturbed structure of BiFeO3 with strain was successfully optimized with the trained potential. Our work has expanded the powerful ML GNN potentials to magnetic systems, which paves a new way for large-scale dynamic simulations on spin-lattice coupled systems.
    Adaptive Early-Learning Correction for Segmentation from Noisy Annotations. (arXiv:2110.03740v2 [cs.CV] UPDATED)
    Deep learning in the presence of noisy annotations has been studied extensively in classification, but much less in segmentation tasks. In this work, we study the learning dynamics of deep segmentation networks trained on inaccurately-annotated data. We discover a phenomenon that has been previously reported in the context of classification: the networks tend to first fit the clean pixel-level labels during an "early-learning" phase, before eventually memorizing the false annotations. However, in contrast to classification, memorization in segmentation does not arise simultaneously for all semantic categories. Inspired by these findings, we propose a new method for segmentation from noisy annotations with two key elements. First, we detect the beginning of the memorization phase separately for each category during training. This allows us to adaptively correct the noisy annotations in order to exploit early learning. Second, we incorporate a regularization term that enforces consistency across scales to boost robustness against annotation noise. Our method outperforms standard approaches on a medical-imaging segmentation task where noises are synthesized to mimic human annotation errors. It also provides robustness to realistic noisy annotations present in weakly-supervised semantic segmentation, achieving state-of-the-art results on PASCAL VOC 2012. Code is available at https://github.com/Kangningthu/ADELE
    A Novel Dual Dense Connection Network for Video Super-resolution. (arXiv:2203.02723v1 [eess.IV])
    Video super-resolution (VSR) refers to the reconstruction of high-resolution (HR) video from the corresponding low-resolution (LR) video. Recently, VSR has received increasing attention. In this paper, we propose a novel dual dense connection network that can generate high-quality super-resolution (SR) results. The input frames are creatively divided into reference frame, pre-temporal group and post-temporal group, representing information in different time periods. This grouping method provides accurate information of different time periods without causing time information disorder. Meanwhile, we produce a new loss function, which is beneficial to enhance the convergence ability of the model. Experiments show that our model is superior to other advanced models in Vid4 datasets and SPMCS-11 datasets.
    Can You Hear It? Backdoor Attacks via Ultrasonic Triggers. (arXiv:2107.14569v3 [cs.CR] UPDATED)
    This work explores backdoor attacks for automatic speech recognition systems where we inject inaudible triggers. By doing so, we make the backdoor attack challenging to detect for legitimate users, and thus, potentially more dangerous. We conduct experiments on two versions of a speech dataset and three neural networks and explore the performance of our attack concerning the duration, position, and type of the trigger. Our results indicate that less than 1% of poisoned data is sufficient to deploy a backdoor attack and reach a 100% attack success rate. We observed that short, non-continuous triggers result in highly successful attacks. However, since our trigger is inaudible, it can be as long as possible without raising any suspicions making the attack more effective. Finally, we conducted our attack in actual hardware and saw that an adversary could manipulate inference in an Android application by playing the inaudible trigger over the air.
    Uncertify: Attacks Against Neural Network Certification. (arXiv:2108.11299v2 [cs.LG] UPDATED)
    Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. However, introducing certifiers into deep learning systems also opens up new attack vectors, which need to be considered before deployment. In this work, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data points during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
    A Small Gain Analysis of Single Timescale Actor Critic. (arXiv:2203.02591v1 [math.OC])
    We consider a version of actor-critic which uses proportional step-sizes and only one critic update with a single sample from the stationary distribution per actor step. We provide an analysis of this method using the small gain theorem. Specifically, we prove that this method can be used to find a stationary point, and that the resulting sample complexity improves the state of the art for actor-critic methods from $O \left(\mu^{-4} \epsilon^{-2} \right)$ to $O \left(\mu^{-2} \epsilon^{-2} \right)$ to find an $\epsilon$-approximate stationary point where $\mu$ is the condition number associated with the critic.
    Fully Decentralized, Scalable Gaussian Processes for Multi-Agent Federated Learning. (arXiv:2203.02865v1 [stat.ML])
    In this paper, we propose decentralized and scalable algorithms for Gaussian process (GP) training and prediction in multi-agent systems. To decentralize the implementation of GP training optimization algorithms, we employ the alternating direction method of multipliers (ADMM). A closed-form solution of the decentralized proximal ADMM is provided for the case of GP hyper-parameter training with maximum likelihood estimation. Multiple aggregation techniques for GP prediction are decentralized with the use of iterative and consensus methods. In addition, we propose a covariance-based nearest neighbor selection strategy that enables a subset of agents to perform predictions. The efficacy of the proposed methods is illustrated with numerical experiments on synthetic and real data.
    Learning Latent Graph Dynamics for Visual Manipulation of Deformable Objects. (arXiv:2104.12149v2 [cs.RO] UPDATED)
    Manipulating deformable objects, such as ropes and clothing, is a long-standing challenge in robotics, because of their large degrees of freedom, complex non-linear dynamics, and self-occlusion in visual perception. The key difficulty is a suitable representation, rich enough to capture the object shape, dynamics for manipulation and yet simple enough to be estimated reliably from visual observations. This work aims to learn latent Graph dynamics for DefOrmable Object Manipulation (G-DOOM). G-DOOM approximates a deformable object as a sparse set of interacting keypoints, which are extracted automatically from images via unsupervised learning. It learns a graph neural network that captures abstractly the geometry and the interaction dynamics of the keypoints. To handle object self-occlusion, G-DOOM uses a recurrent neural network to track the keypoints over time and condition their interactions on the history. We then train the resulting recurrent graph dynamics model through contrastive learning in a high-fidelity simulator. For manipulation planning, G-DOOM reasons explicitly about the learned dynamics model through model-predictive control applied at each keypoint. Preliminary experiments of G-DOOM on a set of challenging rope and cloth manipulation tasks indicate strong performance, compared with state-of-the-art methods. Although trained in a simulator, G-DOOM transfers directly to a real robot for both rope and cloth manipulation.
    Dual Space Graph Contrastive Learning. (arXiv:2201.07409v2 [cs.LG] UPDATED)
    Unsupervised graph representation learning has emerged as a powerful tool to address real-world problems and achieves huge success in the graph learning domain. Graph contrastive learning is one of the unsupervised graph representation learning methods, which recently attracts attention from researchers and has achieved state-of-the-art performances on various tasks. The key to the success of graph contrastive learning is to construct proper contrasting pairs to acquire the underlying structural semantics of the graph. However, this key part is not fully explored currently, most of the ways generating contrasting pairs focus on augmenting or perturbating graph structures to obtain different views of the input graph. But such strategies could degrade the performances via adding noise into the graph, which may narrow down the field of the applications of graph contrastive learning. In this paper, we propose a novel graph contrastive learning method, namely \textbf{D}ual \textbf{S}pace \textbf{G}raph \textbf{C}ontrastive (DSGC) Learning, to conduct graph contrastive learning among views generated in different spaces including the hyperbolic space and the Euclidean space. Since both spaces have their own advantages to represent graph data in the embedding spaces, we hope to utilize graph contrastive learning to bridge the spaces and leverage advantages from both sides. The comparison experiment results show that DSGC achieves competitive or better performances among all the datasets. In addition, we conduct extensive experiments to analyze the impact of different graph encoders on DSGC, giving insights about how to better leverage the advantages of contrastive learning between different spaces.
    P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model. (arXiv:2006.12101v4 [cs.LG] UPDATED)
    How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately handle high dimensional data. In particular, when the input dataset contains a large number of features, the existing techniques require injecting a prohibitive amount of noise to satisfy differential privacy, which results in the outsourced data analysis meaningless. To address the above issue, this paper proposes privacy-preserving phased generative model (P3GM), which is a differentially private generative model for releasing such sensitive data. P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency (e.g., easy to converge). We give theoretical analyses about the learning complexity and privacy loss in P3GM. We further experimentally evaluate our proposed method and demonstrate that P3GM significantly outperforms existing solutions. Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity. Besides, in several data mining tasks with synthesized data, our model outperforms the competitors in terms of accuracy.
    Generative Modeling with Optimal Transport Maps. (arXiv:2110.02999v2 [cs.LG] UPDATED)
    With the discovery of Wasserstein GANs, Optimal Transport (OT) has become a powerful tool for large-scale generative modeling tasks. In these tasks, OT cost is typically used as the loss for training GANs. In contrast to this approach, we show that the OT map itself can be used as a generative model, providing comparable performance. Previous analogous approaches consider OT maps as generative models only in the latent spaces due to their poor performance in the original high-dimensional ambient space. In contrast, we apply OT maps directly in the ambient space, e.g., a space of high-dimensional images. First, we derive a min-max optimization algorithm to efficiently compute OT maps for the quadratic cost (Wasserstein-2 distance). Next, we extend the approach to the case when the input and output distributions are located in the spaces of different dimensions and derive error bounds for the computed OT map. We evaluate the algorithm on image generation and unpaired image restoration tasks. In particular, we consider denoising, colorization, and inpainting, where the optimality of the restoration map is a desired attribute, since the output (restored) image is expected to be close to the input (degraded) one.
    A Perspective on Robotic Telepresence and Teleoperation using Cognition: Are we there yet?. (arXiv:2203.02959v1 [cs.RO])
    Telepresence and teleoperation robotics have attracted a great amount of attention in the last 10 years. With the Artificial Intelligence (AI) revolution already being started, we can see a wide range of robotic applications being realized. Intelligent robotic systems are being deployed both in industrial and domestic environments. Telepresence is the idea of being present in a remote location virtually or via robotic avatars. Similarly, the idea of operating a robot from a remote location for various tasks is called teleoperation. These technologies find significant application in health care, education, surveillance, disaster recovery, and corporate/government sectors. But question still remains about their maturity, security and safety levels. We also need to think about enhancing the user experience and trust in such technologies going into the next generation of computing.
    Who will Leave a Pediatric Weight Management Program and When? -- A machine learning approach for predicting attrition patterns. (arXiv:2202.01765v3 [cs.LG] UPDATED)
    Childhood obesity is a major public health concern. Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity and severe obesity who are not able to be successfully managed in the primary care setting; however, high drop-out rates (referred to as attrition) are a major hurdle in delivering successful interventions. Predicting attrition patterns can help providers reduce the attrition rates. Previous work has mainly focused on finding static predictors of attrition using statistical analysis methods. In this study, we present a machine learning model to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining a weight management program. We use a five-year dataset containing the information related to around 4,550 children that we have compiled using data from the Nemours Pediatric Weight Management program. Our models show strong prediction performance as determined by high AUROC scores across different tasks (average AUROC of 0.75 for predicting attrition, and 0.73 for predicting weight outcomes). Additionally, we report the top features predicting attrition and weight outcomes in a series of explanatory experiments.
    The Grammar-Learning Trajectories of Neural Language Models. (arXiv:2109.06096v2 [cs.CL] UPDATED)
    The learning trajectories of linguistic phenomena in humans provide insight into linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply a similar approach to analyze neural language models (NLM), it is first necessary to establish that different models are similar enough in the generalizations they make. In this paper, we show that NLMs with different initialization, architecture, and training data acquire linguistic phenomena in a similar order, despite their different end performance. These findings suggest that there is some mutual inductive bias that underlies these models' learning of linguistic phenomena. Taking inspiration from psycholinguistics, we argue that studying this inductive bias is an opportunity to study the linguistic representation implicit in NLMs. Leveraging these findings, we compare the relative performance on different phenomena at varying learning stages with simpler reference models. Results suggest that NLMs exhibit consistent "developmental" stages. Moreover, we find the learning trajectory to be approximately one-dimensional: given an NLM with a certain overall performance, it is possible to predict what linguistic generalizations it has already acquired. Initial analysis of these stages presents phenomena clusters (notably morphological ones), whose performance progresses in unison, suggesting a potential link between the generalizations behind them.
    Efficiently Modeling Long Sequences with Structured State Spaces. (arXiv:2111.00396v2 [cs.LG] UPDATED)
    A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \), and showed that for appropriate choices of the state matrix \( A \), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \( A \) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.
    LocoProp: Enhancing BackProp via Local Loss Optimization. (arXiv:2106.06199v2 [cs.LG] UPDATED)
    Second-order methods have shown state-of-the-art performance for optimizing deep neural networks. Nonetheless, their large memory requirement and high computational complexity, compared to first-order methods, hinder their versatility in a typical low-budget setup. This paper introduces a general framework of layerwise loss construction for multilayer neural networks that achieves a performance closer to second-order methods while utilizing first-order optimizers only. Our methodology lies upon a three-component loss, target, and regularizer combination, for which altering each component results in a new update rule. We provide examples using squared loss and layerwise Bregman divergences induced by the convex integral functions of various transfer functions. Our experiments on benchmark models and datasets validate the efficacy of our new approach, reducing the gap between first-order and second-order optimizers.
    Infinitely Divisible Noise in the Low Privacy Regime. (arXiv:2110.06559v3 [cs.LG] UPDATED)
    Federated learning, in which training data is distributed among users and never shared, has emerged as a popular approach to privacy-preserving machine learning. Cryptographic techniques such as secure aggregation are used to aggregate contributions, like a model update, from all users. A robust technique for making such aggregates differentially private is to exploit infinite divisibility of the Laplace distribution, namely, that a Laplace distribution can be expressed as a sum of i.i.d. noise shares from a Gamma distribution, one share added by each user. However, Laplace noise is known to have suboptimal error in the low privacy regime for $\varepsilon$-differential privacy, where $\varepsilon > 1$ is a large constant. In this paper we present the first infinitely divisible noise distribution for real-valued data that achieves $\varepsilon$-differential privacy and has expected error that decreases exponentially with $\varepsilon$.
    ILDAE: Instance-Level Difficulty Analysis of Evaluation Data. (arXiv:2203.03073v1 [cs.CL])
    Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating students' potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in NLP? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications result in several interesting findings, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our analyses and findings will bring more attention to this important yet understudied field of leveraging instance difficulty in evaluations.
    The application of adaptive minimum match k-nearest neighbors to identify at-risk students in health professions education. (arXiv:2108.07709v2 [cs.CY] UPDATED)
    Purpose: When a learner fails to reach a milestone, educators often wonder if there had been any warning signs that could have allowed them to intervene sooner. Machine learning can predict which students are at risk of failing a high-stakes certification exam. If predictions can be made well in advance of the exam, then educators can meaningfully intervene before students take the exam to reduce the chances of a failing score. Method: Using already-collected, first-year student assessment data from five cohorts in a Master of Physician Assistant Studies program, the authors implement an "adaptive minimum match" version of the k-nearest neighbors algorithm (AMMKNN), using changing numbers of neighbors to predict each student's future exam scores on the Physician Assistant National Certifying Examination (PANCE). Validation occurred in two ways: Leave-one-out cross-validation (LOOCV) and evaluating the predictions in a new cohort. Results: AMMKNN achieved an accuracy of 93% in LOOCV. AMMKNN generates a predicted PANCE score for each student, one year before they are scheduled to take the exam. Students can then be classified into extra support, optional extra support, or no extra support groups. The educator then has one year to provide the appropriate customized support to each category of student. Conclusions: Predictive analytics can identify at-risk students, so they can receive additional support or remediation when preparing for high-stakes certification exams. Educators can use the included methods and code to generate predicted test outcomes for students. The authors recommend that educators use this or similar predictive methods responsibly and transparently, as one of many tools used to support students.
    Predicting Flat-Fading Channels via Meta-Learned Closed-Form Linear Filters and Equilibrium Propagation. (arXiv:2110.00414v2 [cs.IT] UPDATED)
    Predicting fading channels is a classical problem with a vast array of applications, including as an enabler of artificial intelligence (AI)-based proactive resource allocation for cellular networks. Under the assumption that the fading channel follows a stationary complex Gaussian process, as for Rayleigh and Rician fading models, the optimal predictor is linear, and it can be directly computed from the Doppler spectrum via standard linear minimum mean squared error (LMMSE) estimation. However, in practice, the Doppler spectrum is unknown, and the predictor has only access to a limited time series of estimated channels. This paper proposes to leverage meta-learning in order to mitigate the requirements in terms of training data for channel fading prediction. Specifically, it first develops an offline low-complexity solution based on linear filtering via a meta-trained quadratic regularization. Then, an online method is proposed based on gradient descent and equilibrium propagation (EP). Numerical results demonstrate the advantages of the proposed approach, showing its capacity to approach the genie-aided LMMSE solution with a small number of training data points.
    Sequential Recommendation via Stochastic Self-Attention. (arXiv:2201.06035v2 [cs.IR] UPDATED)
    Sequential recommendation models the dynamics of a user's previous behaviors in order to forecast the next item, and has drawn a lot of attention. Transformer-based approaches, which embed items as vectors and use dot-product self-attention to measure the relationship between items, demonstrate superior capabilities among existing sequential methods. However, users' real-world sequential behaviors are \textit{\textbf{uncertain}} rather than deterministic, posing a significant challenge to present techniques. We further suggest that dot-product-based approaches cannot fully capture \textit{\textbf{collaborative transitivity}}, which can be derived in item-item transitions inside sequences and is beneficial for cold start items. We further argue that BPR loss has no constraint on positive and sampled negative items, which misleads the optimization. We propose a novel \textbf{STO}chastic \textbf{S}elf-\textbf{A}ttention~(STOSA) to overcome these issues. STOSA, in particular, embeds each item as a stochastic Gaussian distribution, the covariance of which encodes the uncertainty. We devise a novel Wasserstein Self-Attention module to characterize item-item position-wise relationships in sequences, which effectively incorporates uncertainty into model training. Wasserstein attentions also enlighten the collaborative transitivity learning as it satisfies triangle inequality. Moreover, we introduce a novel regularization term to the ranking loss, which assures the dissimilarity between positive and the negative items. Extensive experiments on five real-world benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art baselines, especially on cold start items. The code is available in \url{https://github.com/zfan20/STOSA}.
    Anatomy-XNet: A Semi-Supervised Anatomy Aware Convolutional Neural Network for Thoracic Disease Classification. (arXiv:2106.05915v2 [eess.IV] UPDATED)
    Thoracic disease detection from chest radiographs using deep learning methods has been an active area of research in the last decade. Most previous methods attempt to focus on the diseased organs of the image by identifying spatial regions responsible for significant contributions to the model's prediction. In contrast, expert radiologists first locate the prominent anatomical structures before determining if those regions are anomalous. Therefore, integrating anatomical knowledge within deep learning models could bring substantial improvement in automatic disease classification. Motivated by this, we propose Anatomy-XNet, an anatomy-aware attention-based thoracic disease classification network that prioritizes the spatial features guided by the pre-identified anatomy regions. We adopt a semi-supervised learning method by utilizing available small-scale organ-level annotation to localize the anatomy regions in large-scale datasets where the organ-level annotations are absent. The proposed Anatomy-XNet uses the pre-trained DenseNet-121 as the backbone network with two corresponding structured modules, the Anatomy Aware Attention (AAA) and Probabilistic Weighted Average Pooling (PWAP), in a cohesive framework for anatomical attention learning. We experimentally show that our proposed method sets a new state-of-the-art benchmark by achieving an AUC score of 85.66%, 91.13%, and, 84.04% on three publicly available large-scale CXR datasets--NIH, Stanford CheXpert, and MIMIC-CXR, respectively. This not only proves the efficacy of utilizing the anatomy segmentation knowledge to improve the thoracic disease classification but also demonstrates the generalizability of the proposed framework.
    Leveraging Reward Gradients For Reinforcement Learning in Differentiable Physics Simulations. (arXiv:2203.02857v1 [cs.LG])
    In recent years, fully differentiable rigid body physics simulators have been developed, which can be used to simulate a wide range of robotic systems. In the context of reinforcement learning for control, these simulators theoretically allow algorithms to be applied directly to analytic gradients of the reward function. However, to date, these gradients have proved extremely challenging to use, and are outclassed by algorithms using no gradient information at all. In this work we present a novel algorithm, cross entropy analytic policy gradients, that is able to leverage these gradients to outperform state of art deep reinforcement learning on a set of challenging nonlinear control problems.
    Unrolling PALM for sparse semi-blind source separation. (arXiv:2112.05694v2 [astro-ph.IM] UPDATED)
    Sparse Blind Source Separation (BSS) has become a well established tool for a wide range of applications - for instance, in astrophysics and remote sensing. Classical sparse BSS methods, such as the Proximal Alternating Linearized Minimization (PALM) algorithm, nevertheless often suffer from a difficult hyperparameter choice, which undermines their results. To bypass this pitfall, we propose in this work to build on the thriving field of algorithm unfolding/unrolling. Unrolling PALM enables to leverage the data-driven knowledge stemming from realistic simulations or ground-truth data by learning both PALM hyperparameters and variables. In contrast to most existing unrolled algorithms, which assume a fixed known dictionary during the training and testing phases, this article further emphasizes on the ability to deal with variable mixing matrices (a.k.a. dictionaries). The proposed Learned PALM (LPALM) algorithm thus enables to perform semi-blind source separation, which is key to increase the generalization of the learnt model in real-world applications. We illustrate the relevance of LPALM in astrophysical multispectral imaging: the algorithm not only needs up to $10^4-10^5$ times fewer iterations than PALM, but also improves the separation quality, while avoiding the cumbersome hyperparameter and initialization choice of PALM. We further show that LPALM outperforms other unrolled source separation methods in the semi-blind setting.
    Federated Learning with Buffered Asynchronous Aggregation. (arXiv:2106.06639v4 [cs.LG] UPDATED)
    Scalability and privacy are two critical concerns for cross-device federated learning (FL) systems. In this work, we identify that synchronous FL - synchronized aggregation of client updates in FL - cannot scale efficiently beyond a few hundred clients training in parallel. It leads to diminishing returns in model performance and training speed, analogous to large-batch training. On the other hand, asynchronous aggregation of client updates in FL (i.e., asynchronous FL) alleviates the scalability issue. However, aggregating individual client updates is incompatible with Secure Aggregation, which could result in an undesirable level of privacy for the system. To address these concerns, we propose a novel buffered asynchronous aggregation method, FedBuff, that is agnostic to the choice of optimizer, and combines the best properties of synchronous and asynchronous FL. We empirically demonstrate that FedBuff is 3.3x more efficient than synchronous FL and up to 2.5x more efficient than asynchronous FL, while being compatible with privacy-preserving technologies such as Secure Aggregation and differential privacy. We provide theoretical convergence guarantees in a smooth non-convex setting. Finally, we show that under differentially private training, FedBuff can outperform FedAvgM at low privacy settings and achieve the same utility for higher privacy settings.
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v2 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain a $\tilde{\mathcal{O}}(dB_*^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_*$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and a $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves a $\tilde{\mathcal{O}}(dB_*\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_* \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.
    Training privacy-preserving video analytics pipelines by suppressing features that reveal information about private attributes. (arXiv:2203.02635v1 [cs.CV])
    Deep neural networks are increasingly deployed for scene analytics, including to evaluate the attention and reaction of people exposed to out-of-home advertisements. However, the features extracted by a deep neural network that was trained to predict a specific, consensual attribute (e.g. emotion) may also encode and thus reveal information about private, protected attributes (e.g. age or gender). In this work, we focus on such leakage of private information at inference time. We consider an adversary with access to the features extracted by the layers of a deployed neural network and use these features to predict private attributes. To prevent the success of such an attack, we modify the training of the network using a confusion loss that encourages the extraction of features that make it difficult for the adversary to accurately predict private attributes. We validate this training approach on image-based tasks using a publicly available dataset. Results show that, compared to the original network, the proposed PrivateNet can reduce the leakage of private information of a state-of-the-art emotion recognition classifier by 2.88% for gender and by 13.06% for age group, with a minimal effect on task accuracy.
    Audio-visual speech separation based on joint feature representation with cross-modal attention. (arXiv:2203.02655v1 [cs.SD])
    Multi-modal based speech separation has exhibited a specific advantage on isolating the target character in multi-talker noisy environments. Unfortunately, most of current separation strategies prefer a straightforward fusion based on feature learning of each single modality, which is far from sufficient consideration of inter-relationships between modalites. Inspired by learning joint feature representations from audio and visual streams with attention mechanism, in this study, a novel cross-modal fusion strategy is proposed to benefit the whole framework with semantic correlations between different modalities. To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated to strengthen the robustness of visual representation. The evaluation of the proposed work is performed on two public audio-visual speech separation benchmark datasets. The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal, as well as outperforming the baseline in terms of all metrics with the proposed cross-modal fusion.
    Neural tensor contractions and the expressive power of deep neural quantum states. (arXiv:2103.10293v3 [quant-ph] UPDATED)
    We establish a direct connection between general tensor networks and deep feed-forward artificial neural networks. The core of our results is the construction of neural-network layers that efficiently perform tensor contractions, and that use commonly adopted non-linear activation functions. The resulting deep networks feature a number of edges that closely matches the contraction complexity of the tensor networks to be approximated. In the context of many-body quantum states, this result establishes that neural-network states have strictly the same or higher expressive power than practically usable variational tensor networks. As an example, we show that all matrix product states can be efficiently written as neural-network states with a number of edges polynomial in the bond dimension and depth logarithmic in the system size. The opposite instead does not hold true, and our results imply that there exist quantum states that are not efficiently expressible in terms of matrix product states or PEPS, but that are instead efficiently expressible with neural network states.
    Frames for Graph Signals on the Symmetric Group: A Representation Theoretic Approach. (arXiv:2203.03036v1 [eess.SP])
    An important problem in the field of graph signal processing is developing appropriate overcomplete dictionaries for signals defined on different families of graphs. The Cayley graph of the symmetric group has natural applications in ranked data analysis, as its vertices represent permutations, while the generating set formalizes a notion of distance between rankings. Taking advantage of the rich theory of representations of the symmetric group, we study a particular class of frames, called Frobenius-Schur frames, where every atom belongs to the coefficient space of only one irreducible representation of the symmetric group. We provide a characterization for all Frobenius-Schur frames on the group algebra of the symmetric group which are "compatible" with respect to the generating set. Such frames have been previously studied for the permutahedron, the Cayley graph of the symmetric group with the generating set of adjacent transpositions, and have proved to be capable of producing meaningful interpretation of the ranked data set via the analysis coefficients. Our results generalize frame constructions for the permutahedron to any inverse-closed generating set.
    Differentially Private Federated Learning with Local Regularization and Sparsification. (arXiv:2203.03106v1 [cs.LG])
    User-level differential privacy (DP) provides certifiable privacy guarantees to the information that is specific to any user's data in federated learning. Existing methods that ensure user-level DP come at the cost of severe accuracy decrease. In this paper, we study the cause of model performance degradation in federated learning under user-level DP guarantee. We find the key to solving this issue is to naturally restrict the norm of local updates before executing operations that guarantee DP. To this end, we propose two techniques, Bounded Local Update Regularization and Local Update Sparsification, to increase model quality without sacrificing privacy. We provide theoretical analysis on the convergence of our framework and give rigorous privacy guarantees. Extensive experiments show that our framework significantly improves the privacy-utility trade-off over the state-of-the-arts for federated learning with user-level DP guarantee.
    Frame Averaging for Invariant and Equivariant Network Design. (arXiv:2110.03336v2 [cs.LG] UPDATED)
    Many machine learning tasks involve learning functions that are known to be invariant or equivariant to certain symmetries of the input data. However, it is often challenging to design neural network architectures that respect these symmetries while being expressive and computationally efficient. For example, Euclidean motion invariant/equivariant graph or point cloud neural networks. We introduce Frame Averaging (FA), a general purpose and systematic framework for adapting known (backbone) architectures to become invariant or equivariant to new symmetry types. Our framework builds on the well known group averaging operator that guarantees invariance or equivariance but is intractable. In contrast, we observe that for many important classes of symmetries, this operator can be replaced with an averaging operator over a small subset of the group elements, called a frame. We show that averaging over a frame guarantees exact invariance or equivariance while often being much simpler to compute than averaging over the entire group. Furthermore, we prove that FA-based models have maximal expressive power in a broad setting and in general preserve the expressive power of their backbone architectures. Using frame averaging, we propose a new class of universal Graph Neural Networks (GNNs), universal Euclidean motion invariant point cloud networks, and Euclidean motion invariant Message Passing (MP) GNNs. We demonstrate the practical effectiveness of FA on several applications including point cloud normal estimation, beyond $2$-WL graph separation, and $n$-body dynamics prediction, achieving state-of-the-art results in all of these benchmarks.
    Disentangling intrinsic motion from neighbourhood effects in heterogeneous collective motion. (arXiv:2110.05864v2 [cs.LG] UPDATED)
    Most real world collectives, including active particles, living cells, and grains, are heterogeneous, where individuals with differing properties interact. The differences among individuals in their intrinsic properties have emergent effects at the group level. It is often of interest to infer how the intrinsic properties differ among the individuals, based on their observed movement patterns. However, the true individual properties may be masked by emergent effects in the collective. We investigate the inference problem in the context of a bidisperse collective with two types of agents, where the goal is to observe the motion of the collective and classify the agents according to their types. Since collective effects such as jamming and clustering affect individual motion, an agent's own movement does not have sufficient information to perform the classification well: a simple observer algorithm, based only on individual velocities cannot accurately estimate the level of heterogeneity of the system, and often misclassifies agents. We propose a novel approach to the classification problem, where collective effects on an agent's motion is explicitly accounted for. We use insights about the physics of collective motion to quantify the effect of the neighbourhood on an agent using a neighbourhood parameter. Such an approach can distinguish between agents of two types, even when their observed motion is identical. This approach estimates the level of heterogeneity much more accurately, and achieves significant improvements in classification. Our results demonstrate that explicitly accounting for neighbourhood effects is often necessary to correctly infer intrinsic properties of individuals.
    Low-cost prediction of molecular and transition state partition functions via machine learning. (arXiv:2203.02621v1 [physics.chem-ph])
    We have generated an open-source dataset of over 30000 organic chemistry gas phase partition functions. With this data, a machine learning deep neural network estimator was trained to predict partition functions of unknown organic chemistry gas phase transition states. This estimator only relies on reactant and product geometries and partition functions. A second machine learning deep neural network was trained to predict partition functions of chemical species from their geometry. Our models accurately predict the logarithm of test set partition functions with a maximum mean absolute error of 2.7%. Thus, this approach provides a means to reduce the cost of computing reaction rate constants ab initio. The models were also used to compute transition state theory reaction rate constants prefactors and the results were in quantitative agreement with the corresponding ab initio calculations with an accuracy of 98.3% on the log scale.
    Prediction of transport property via machine learning molecular movements. (arXiv:2203.03103v1 [physics.chem-ph])
    Molecular dynamics (MD) simulations are increasingly being combined with machine learning (ML) to predict material properties. The molecular configurations obtained from MD are represented by multiple features, such as thermodynamic properties, and are used as the ML input. However, to accurately find the input--output patterns, ML requires a sufficiently sized dataset that depends on the complexity of the ML model. Generating such a large dataset from MD simulations is not ideal because of their high computation cost. In this study, we present a simple supervised ML method to predict the transport properties of materials. To simplify the model, an unsupervised ML method obtains an efficient representation of molecular movements. This method was applied to predict the viscosity of lubricant molecules in confinement with shear flow. Furthermore, simplicity facilitates the interpretation of the model to understand the molecular mechanics of viscosity. We revealed two types of molecular mechanisms that contribute to low viscosity.
    Domain Adaptation with Factorizable Joint Shift. (arXiv:2203.02902v1 [cs.LG])
    Existing domain adaptation (DA) usually assumes the domain shift comes from either the covariates or the labels. However, in real-world applications, samples selected from different domains could have biases in both the covariates and the labels. In this paper, we propose a new assumption, Factorizable Joint Shift (FJS), to handle the co-existence of sampling bias in covariates and labels. Although allowing for the shift from both sides, FJS assumes the independence of the bias between the two factors. We provide theoretical and empirical understandings about when FJS degenerates to prior assumptions and when it is necessary. We further propose Joint Importance Aligning (JIA), a discriminative learning objective to obtain joint importance estimators for both supervised and unsupervised domain adaptation. Our method can be seamlessly incorporated with existing domain adaptation algorithms for better importance estimation and weighting on the training data. Experiments on a synthetic dataset demonstrate the advantage of our method.
    MaxDropoutV2: An Improved Method to Drop out Neurons in Convolutional Neural Networks. (arXiv:2203.02740v1 [cs.LG])
    In the last decade, exponential data growth supplied the machine learning-based algorithms' capacity and enabled their usage in daily life activities. Additionally, such an improvement is partially explained due to the advent of deep learning techniques, i.e., stacks of simple architectures that end up in more complex models. Although both factors produce outstanding results, they also pose drawbacks regarding the learning process since training complex models denotes an expensive task and results are prone to overfit the training data. A supervised regularization technique called MaxDropout was recently proposed to tackle the latter, providing several improvements concerning traditional regularization approaches. In this paper, we present its improved version called MaxDropoutV2. Results considering two public datasets show that the model performs faster than the standard version and, in most cases, provides more accurate results.
    Generating Out of Distribution Adversarial Attack using Latent Space Poisoning. (arXiv:2012.05027v2 [cs.CV] UPDATED)
    Traditional adversarial attacks rely upon the perturbations generated by gradients from the network which are generally safeguarded by gradient guided search to provide an adversarial counterpart to the network. In this paper, we propose a novel mechanism of generating adversarial examples where the actual image is not corrupted rather its latent space representation is utilized to tamper with the inherent structure of the image while maintaining the perceptual quality intact and to act as legitimate data samples. As opposed to gradient-based attacks, the latent space poisoning exploits the inclination of classifiers to model the independent and identical distribution of the training dataset and tricks it by producing out of distribution samples. We train a disentangled variational autoencoder (beta-VAE) to model the data in latent space and then we add noise perturbations using a class-conditioned distribution function to the latent space under the constraint that it is misclassified to the target label. Our empirical results on MNIST, SVHN, and CelebA dataset validate that the generated adversarial examples can easily fool robust l_0, l_2, l_inf norm classifiers designed using provably robust defense mechanisms.
    A Local Method for Identifying Causal Relations under Markov Equivalence. (arXiv:2102.12685v2 [stat.ML] UPDATED)
    Causality is important for designing interpretable and robust methods in artificial intelligence research. We propose a local approach to identify whether a variable is a cause of a given target under the framework of causal graphical models of directed acyclic graphs (DAGs). In general, the causal relation between two variables may not be identifiable from observational data as many causal DAGs encoding different causal relations are Markov equivalent. In this paper, we first introduce a sufficient and necessary graphical condition to check the existence of a causal path from a variable to a target in every Markov equivalent DAG. Next, we provide local criteria for identifying whether a variable is a cause/non-cause of a target based only on the local structure instead of the entire graph. Finally, we propose a local learning algorithm for this causal query via learning the local structure of the variable and some additional statistical independence tests related to the target. Simulation studies show that our local algorithm is efficient and effective, compared with other state-of-art methods.
    Maximum Consensus by Weighted Influences of Monotone Boolean Functions. (arXiv:2112.00953v4 [cs.CV] UPDATED)
    Robust model fitting is a fundamental problem in computer vision: used to pre-process raw data in the presence of outliers. Maximisation of Consensus (MaxCon) is one of the most popular robust criteria and widely used. Recently (Tennakoon et al. CVPR2021), a connection has been made between MaxCon and estimation of influences of a Monotone Boolean function. Equipping the Boolean cube with different measures and adopting different sampling strategies (two sides of the same coin) can have differing effects: which leads to the current study. This paper studies the concept of weighted influences for solving MaxCon. In particular, we study endowing the Boolean cube with the Bernoulli measure and performing biased (as opposed to uniform) sampling. Theoretically, we prove the weighted influences, under this measure, of points belonging to larger structures are smaller than those of points belonging to smaller structures in general. We also consider another "natural" family of sampling/weighting strategies, sampling with uniform measure concentrated on a particular (Hamming) level of the cube. Based on weighted sampling, we modify the algorithm of Tennakoon et al., and test on both synthetic and real datasets. This paper is not promoting a new approach per se, but rather studying the issue of weighted sampling. Accordingly, we are not claiming to have produced a superior algorithm: rather we show some modest gains of Bernoulli sampling, and we illuminate some of the interactions between structure in data and weighted sampling.
    Machine Learning for CUDA+MPI Design Rules. (arXiv:2203.02530v1 [cs.PF])
    We present a new strategy for automatically exploring the design space of key CUDA+MPI programs and providing design rules that discriminate slow from fast implementations. In such programs, the order of operations (e.g., GPU kernels, MPI communication) and assignment of operations to resources (e.g., GPU streams) makes the space of possible designs enormous. Systems experts have the task of redesigning and reoptimizing these programs to effectively utilize each new platform. This work provides a prototype tool to reduce that burden. In our approach, a directed acyclic graph of CUDA and MPI operations defines the design space for the program. Monte-Carlo tree search discovers regions of the design space that have large impact on the program's performance. A sequence-to-vector transformation defines features for each explored implementation, and each implementation is assigned a class label according to its relative performance. A decision tree is trained on the features and labels to produce design rules for each class; these rules can be used by systems experts to guide their implementations. We demonstrate our strategy using a key kernel from scientific computing -- sparse-matrix vector multiplication -- on a platform with multiple MPI ranks and GPU streams.
    A streamable large-scale clinical EEG dataset for Deep Learning. (arXiv:2203.02552v1 [cs.LG])
    Deep Learning has revolutionized various fields, including Computer Vision, Natural Language Processing, as well as Biomedical research. Within the field of neuroscience, specifically in electrophysiological neuroimaging, researchers are starting to explore leveraging deep learning to make predictions on their data without extensive feature engineering. The availability of large-scale datasets is a crucial aspect of allowing the experimentation of Deep Learning models. We are publishing the first large-scale clinical EEG dataset that simplifies data access and management for Deep Learning. This dataset contains eyes-closed EEG data prepared from a collection of 1,574 juvenile participants from the Healthy Brain Network. We demonstrate a use case integrating this framework, and discuss why providing such neuroinformatics infrastructure to the community is critical for future scientific discoveries.
    Recursive Monte Carlo and Variational Inference with Auxiliary Variables. (arXiv:2203.02836v1 [cs.LG])
    A key challenge in applying Monte Carlo and variational inference (VI) is the design of proposals and variational families that are flexible enough to closely approximate the posterior, but simple enough to admit tractable densities and variational bounds. This paper presents recursive auxiliary-variable inference (RAVI), a new framework for exploiting flexible proposals, for example based on involved simulations or stochastic optimization, within Monte Carlo and VI algorithms. The key idea is to estimate intractable proposal densities via meta-inference: additional Monte Carlo or variational inference targeting the proposal, rather than the model. RAVI generalizes and unifies several existing methods for inference with expressive approximating families, which we show correspond to specific choices of meta-inference algorithm, and provides new theory for analyzing their bias and variance. We illustrate RAVI's design framework and theorems by using them to analyze and improve upon Salimans et al. (2015)'s Markov Chain Variational Inference, and to design a novel sampler for Dirichlet process mixtures, achieving state-of-the-art results on a standard benchmark dataset from astronomy and on a challenging data-cleaning task with Medicare hospital data.
    Scaling R-GCN Training with Graph Summarization. (arXiv:2203.02622v1 [cs.LG])
    Training of Relation Graph Convolutional Networks (R-GCN) does not scale well with the size of the graph. The amount of gradient information that needs to be stored during training for real-world graphs is often too large for the amount of memory available on most GPUs. In this work, we experiment with the use of graph summarization techniques to compress the graph and hence reduce the amount of memory needed. After training the R-GCN on the graph summary, we transfer the weights back to the original graph and attempt to perform inference on it. We obtain reasonable results on the AIFB, MUTAG and AM datasets. This supports the importance and relevancy of graph summarization methods, whose smaller graph representations scale and reduce the computational overhead involved with novel machine learning models dealing with large Knowledge Graphs. However, further experiments are needed to evaluate whether this also holds true for very large graphs.
    Off-Policy Evaluation in Embedded Spaces. (arXiv:2203.02807v1 [cs.LG])
    Off-policy evaluation methods are important in recommendation systems and search engines, whereby data collected under an old logging policy is used to predict the performance of a new target policy. However, in practice most systems are not observed to recommend most of the possible actions, which is an issue since existing methods require that the probability of the target policy recommending an item can only be non-zero when the probability of the logging policy is non-zero (known as absolute continuity). To circumvent this issue, we explore the use of action embeddings. By representing contexts and actions in an embedding space, we are able to share information to extrapolate behaviors for actions and contexts previously unseen.
    Fast Yet Effective Machine Unlearning. (arXiv:2111.08947v3 [cs.LG] UPDATED)
    Unlearning the data observed during the training of a machine learning (ML) model is an important task that can play a pivotal role in fortifying the privacy and security of ML-based applications. This paper raises the following questions: (i) can we unlearn a single or multiple classes of data from an ML model without looking at the full training data even once? (ii) can we make the process of unlearning fast and scalable to large datasets, and generalize it to different deep networks? We introduce a novel machine unlearning framework with error-maximizing noise generation and impair-repair based weight manipulation that offers an efficient solution to the above questions. An error-maximizing noise matrix is learned for the class to be unlearned using the original model. The noise matrix is used to manipulate the model weights to unlearn the targeted class of data. We introduce impair and repair steps for a controlled manipulation of the network weights. In the impair step, the noise matrix along with a very high learning rate is used to induce sharp unlearning in the model. Thereafter, the repair step is used to regain the overall performance. With very few update steps, we show excellent unlearning while substantially retaining the overall model accuracy. Unlearning multiple classes requires a similar number of update steps as for the single class, making our approach scalable to large problems. Our method is quite efficient in comparison to the existing methods, works for multi-class unlearning, doesn't put any constraints on the original optimization mechanism or network design, and works well in both small and large-scale vision tasks. This work is an important step towards fast and easy implementation of unlearning in deep networks. We will make the source code publicly available.
    Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit. (arXiv:2203.03003v1 [cs.LG])
    We introduce a method for pricing consumer credit using recent advances in offline deep reinforcement learning. This approach relies on a static dataset and requires no assumptions on the functional form of demand. Using both real and synthetic data on consumer credit applications, we demonstrate that our approach using the conservative Q-Learning algorithm is capable of learning an effective personalized pricing policy without any online interaction or price experimentation.
    Sparsifying Transformer Models with Trainable Representation Pooling. (arXiv:2009.05169v4 [cs.CL] UPDATED)
    We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $13\times$ more computationally efficient in the decoder.
    Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem. (arXiv:2110.02629v2 [cs.LG] UPDATED)
    Existing deep reinforcement learning (DRL) based methods for solving the capacitated vehicle routing problem (CVRP) intrinsically cope with homogeneous vehicle fleet, in which the fleet is assumed as repetitions of a single vehicle. Hence, their key to construct a solution solely lies in the selection of the next node (customer) to visit excluding the selection of vehicle. However, vehicles in real-world scenarios are likely to be heterogeneous with different characteristics that affect their capacity (or travel speed), rendering existing DRL methods less effective. In this paper, we tackle heterogeneous CVRP (HCVRP), where vehicles are mainly characterized by different capacities. We consider both min-max and min-sum objectives for HCVRP, which aim to minimize the longest or total travel time of the vehicle(s) in the fleet. To solve those problems, we propose a DRL method based on the attention mechanism with a vehicle selection decoder accounting for the heterogeneous fleet constraint and a node selection decoder accounting for the route construction, which learns to construct a solution by automatically selecting both a vehicle and a node for this vehicle at each step. Experimental results based on randomly generated instances show that, with desirable generalization to various problem sizes, our method outperforms the state-of-the-art DRL method and most of the conventional heuristics, and also delivers competitive performance against the state-of-the-art heuristic method, i.e., SISR. Additionally, the results of extended experiments demonstrate that our method is also able to solve CVRPLib instances with satisfactory performance.
    Towards Efficient and Scalable Sharpness-Aware Minimization. (arXiv:2203.02714v1 [cs.LG])
    Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as vision transformers. However, the update rule of SAM requires two sequential (non-parallelizable) gradient computations at each step, which can double the computational overhead. In this paper, we propose a novel algorithm LookSAM - that only periodically calculates the inner gradient ascent, to significantly reduce the additional training cost of SAM. The empirical results illustrate that LookSAM achieves similar accuracy gains to SAM while being tremendously faster - it enjoys comparable computational complexity with first-order optimizers such as SGD or Adam. To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform experiments in the large-batch training scenario, which is more prone to converge to sharp local minima. We are the first to successfully scale up the batch size when training Vision Transformers (ViTs). With a 64k batch size, we are able to train ViTs from scratch in minutes while maintaining competitive performance.
    McXai: Local model-agnostic explanation as two games. (arXiv:2201.01044v2 [cs.LG] UPDATED)
    To this day, a variety of approaches for providing local interpretability of black-box machine learning models have been introduced. Unfortunately, all of these methods suffer from one or more of the following deficiencies: They are either difficult to understand themselves, they work on a per-feature basis and ignore the dependencies between features and/or they only focus on those features asserting the decision made by the model. To address these points, this work introduces a reinforcement learning-based approach called Monte Carlo tree search for eXplainable Artificial Intelligent (McXai) to explain the decisions of any black-box classification model (classifier). Our method leverages Monte Carlo tree search and models the process of generating explanations as two games. In one game, the reward is maximized by finding feature sets that support the decision of the classifier, while in the second game, finding feature sets leading to alternative decisions maximizes the reward. The result is a human friendly representation as a tree structure, in which each node represents a set of features to be studied with smaller explanations at the top of the tree. Our experiments show, that the features found by our method are more informative with respect to classifications than those found by classical approaches like LIME and SHAP. Furthermore, by also identifying misleading features, our approach is able to guide towards improved robustness of the black-box model in many situations.
    Learning to Minimize the Remainder in Supervised Learning. (arXiv:2201.09193v2 [cs.CV] UPDATED)
    The learning process of deep learning methods usually updates the model's parameters in multiple iterations. Each iteration can be viewed as the first-order approximation of Taylor's series expansion. The remainder, which consists of higher-order terms, is usually ignored in the learning process for simplicity. This learning scheme empowers various multimedia based applications, such as image retrieval, recommendation system, and video search. Generally, multimedia data (e.g., images) are semantics-rich and high-dimensional, hence the remainders of approximations are possibly non-zero. In this work, we consider the remainder to be informative and study how it affects the learning process. To this end, we propose a new learning approach, namely gradient adjustment learning (GAL), to leverage the knowledge learned from the past training iterations to adjust vanilla gradients, such that the remainders are minimized and the approximations are improved. The proposed GAL is model- and optimizer-agnostic, and is easy to adapt to the standard learning framework. It is evaluated on three tasks, i.e., image classification, object detection, and regression, with state-of-the-art models and optimizers. The experiments show that the proposed GAL consistently enhances the evaluated models, whereas the ablation studies validate various aspects of the proposed GAL. The code is available at \url{https://github.com/luoyan407/gradient_adjustment.git}.
    Object-centric Process Predictive Analytics. (arXiv:2203.02801v1 [cs.LG])
    Object-centric processes (a.k.a. Artifact-centric processes) are implementations of a paradigm where an instance of one process is not executed in isolation but interacts with other instances of the same or other processes. Interactions take place through bridging events where instances exchange data. Object-centric processes are recently gaining popularity in academia and industry, because their nature is observed in many application scenarios. This poses significant challenges in predictive analytics due to the complex intricacy of the process instances that relate to each other via many-to-many associations. Existing research is unable to directly exploit the benefits of these interactions, thus limiting the prediction quality. This paper proposes an approach to incorporate the information about the object interactions into the predictive models. The approach is assessed on real-life object-centric process event data, using different KPIs. The results are compared with a naive approach that overlooks the object interactions, thus illustrating the benefits of their use on the prediction quality.
    Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation. (arXiv:2105.07059v4 [cs.CV] UPDATED)
    Contrastive learning (CL) aims to learn useful representation without relying on expert annotations in the context of medical image segmentation. Existing approaches mainly contrast a single positive vector (i.e., an augmentation of the same image) against a set of negatives within the entire remainder of the batch by simply mapping all input features into the same constant vector. Despite the impressive empirical performance, those methods have the following shortcomings: (1) it remains a formidable challenge to prevent the collapsing problems to trivial solutions; and (2) we argue that not all voxels within the same image are equally positive since there exist the dissimilar anatomical structures with the same image. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method to effectively learn low-level and high-level features by capturing 3D spatial context and rich anatomical information along both the feature and the batch dimensions. Specifically, we first introduce a novel CL strategy to ensure feature diversity promotion among the 3D representation dimensions. We train the framework through bi-level contrastive optimization (i.e., low-level and high-level) on 3D images. Experiments on two benchmark datasets and different labeled settings demonstrate the superiority of our proposed framework. More importantly, we also prove that our method inherits the benefit of hardness-aware property from the standard CL approaches.
    Sequential change-point detection for mutually exciting point processes over networks. (arXiv:2102.05724v2 [stat.ML] UPDATED)
    We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abrupt changes in Hawkes networks arises from various applications, including neuronal imaging, sensor network, and social network monitoring. Despite this, there has not been a computationally and memory-efficient online algorithm for detecting such changes from sequential data. We present an efficient online recursive implementation of the CUSUM statistic for Hawkes processes, both decentralized and memory-efficient, and establish the theoretical properties of this new CUSUM procedure. We then show that the proposed CUSUM method achieves better performance than existing methods, including the Shewhart procedure based on count data, the generalized likelihood ratio (GLR) in the existing literature, and the standard score statistic. We demonstrate this via a simulated example and an application to population code change-detection in neuronal networks.
    MIRROR: Differentiable Deep Social Projection for Assistive Human-Robot Communication. (arXiv:2203.02877v1 [cs.AI])
    Communication is a hallmark of intelligence. In this work, we present MIRROR, an approach to (i) quickly learn human models from human demonstrations, and (ii) use the models for subsequent communication planning in assistive shared-control settings. MIRROR is inspired by social projection theory, which hypothesizes that humans use self-models to understand others. Likewise, MIRROR leverages self-models learned using reinforcement learning to bootstrap human modeling. Experiments with simulated humans show that this approach leads to rapid learning and more robust models compared to existing behavioral cloning and state-of-the-art imitation learning methods. We also present a human-subject study using the CARLA simulator which shows that (i) MIRROR is able to scale to complex domains with high-dimensional observations and complicated world physics and (ii) provides effective assistive communication that enabled participants to drive more safely in adverse weather conditions.
    Random Effect Bandits. (arXiv:2106.12200v2 [cs.LG] UPDATED)
    This paper studies regret minimization in a multi-armed bandit. It is well known that side information, such as the prior distribution of arm means in Thompson sampling, can improve the statistical efficiency of the bandit algorithm. While the prior is a blessing when correctly specified, it is a curse when misspecified. To address this issue, we introduce the assumption of a random-effect model to bandits. In this model, the mean arm rewards are drawn independently from an unknown distribution, which we estimate. We derive a random-effect estimator of the arm means, analyze its uncertainty, and design a UCB algorithm ReUCB that uses it. We analyze ReUCB and derive an upper bound on its $n$-round Bayes regret, which improves upon not using the random-effect structure. Our experiments show that ReUCB can outperform Thompson sampling, without knowing the prior distribution of arm means.
    Singular Value Perturbation and Deep Network Optimization. (arXiv:2203.03099v1 [cs.LG])
    We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets and DenseNets at the other. For regression tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a ResNet is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of a ResNet or DenseNet is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.
    Plant Species Recognition with Optimized 3D Polynomial Neural Networks and Variably Overlapping Time-Coherent Sliding Window. (arXiv:2203.02611v1 [cs.CV])
    Recently, the EAGL-I system was developed to rapidly create massive labeled datasets of plants intended to be commonly used by farmers and researchers to create AI-driven solutions in agriculture. As a result, a publicly available plant species recognition dataset composed of 40,000 images with different sizes consisting of 8 plant species was created with the system in order to demonstrate its capabilities. This paper proposes a novel method, called Variably Overlapping Time-Coherent Sliding Window (VOTCSW), that transforms a dataset composed of images with variable size to a 3D representation with fixed size that is suitable for convolutional neural networks, and demonstrates that this representation is more informative than resizing the images of the dataset to a given size. We theoretically formalized the use cases of the method as well as its inherent properties and we proved that it has an oversampling and a regularization effect on the data. By combining the VOTCSW method with the 3D extension of a recently proposed machine learning model called 1-Dimensional Polynomial Neural Networks, we were able to create a model that achieved a state-of-the-art accuracy of 99.9% on the dataset created by the EAGL-I system, surpassing well-known architectures such as ResNet and Inception. In addition, we created a heuristic algorithm that enables the degree reduction of any pre-trained N-Dimensional Polynomial Neural Network and which compresses it without altering its performance, thus making the model faster and lighter. Furthermore, we established that the currently available dataset could not be used for machine learning in its present form, due to a substantial class imbalance between the training set and the test set. Hence, we created a specific preprocessing and a model development framework that enabled us to improve the accuracy from 49.23% to 99.9%.
    Fast and Data Efficient Reinforcement Learning from Pixels via Non-Parametric Value Approximation. (arXiv:2203.03078v1 [cs.LG])
    We present Nonparametric Approximation of Inter-Trace returns (NAIT), a Reinforcement Learning algorithm for discrete action, pixel-based environments that is both highly sample and computation efficient. NAIT is a lazy-learning approach with an update that is equivalent to episodic Monte-Carlo on episode completion, but that allows the stable incorporation of rewards while an episode is ongoing. We make use of a fixed domain-agnostic representation, simple distance based exploration and a proximity graph-based lookup to facilitate extremely fast execution. We empirically evaluate NAIT on both the 26 and 57 game variants of ATARI100k where, despite its simplicity, it achieves competitive performance in the online setting with greater than 100x speedup in wall-time.
    Probably approximately correct quantum source coding. (arXiv:2112.06841v1 [quant-ph] CROSS LISTED)
    Information-theoretic lower bounds are often encountered in several branches of computer science, including learning theory and cryptography. In the quantum setting, Holevo's and Nayak's bounds give an estimate of the amount of classical information that can be stored in a quantum state. Previous works have shown how to combine information-theoretic tools with a counting argument to lower bound the sample complexity of distribution-free quantum probably approximately correct (PAC) learning. In our work, we establish the notion of Probably Approximately Correct Source Coding and we show two novel applications in quantum learning theory and delegated quantum computation with a purely classical client. In particular, we provide a lower bound of the sample complexity of a quantum learner for arbitrary functions under the Zipf distribution, and we improve the security guarantees of a classically-driven delegation protocol for measurement-based quantum computation (MBQC).
    On the Last Iterate Convergence of Momentum Methods. (arXiv:2102.07002v2 [cs.LG] UPDATED)
    SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well.
    Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning. (arXiv:2203.02573v1 [cs.CV])
    Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.
    Fast Community Detection based on Graph Autoencoder Reconstruction. (arXiv:2203.03151v1 [cs.SI])
    With the rapid development of big data, how to efficiently and accurately discover tight community structures in large-scale networks for knowledge discovery has attracted more and more attention. In this paper, a community detection framework based on Graph AutoEncoder Reconstruction (noted as GAER) is proposed for the first time. GAER is a highly scalable framework which does not require any prior information. We decompose the graph autoencoder-based one-step encoding into the two-stage encoding framework to adapt to the real-world big data system by reducing complexity from the original O(N^2) to O(N). At the same time, based on the advantages of GAER support module plug-and-play configuration and incremental community detection, we further propose a peer awareness based module for real-time large graphs, which can realize the new nodes community detection at a faster speed, and accelerate model inference with the 6.15 times - 14.03 times speed. Finally, we apply the GAER on multiple real-world datasets, including some large-scale networks. The experimental result verified that GAER has achieved the superior performance on almost all networks.
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v3 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, motivated by mobile health applications, based on combining key aspects of two established online problems with bandit feedback. Both assume that the optimal action to play is contingent on an underlying changing state which is not directly observable by the agent. They differ in what kind of side information can be used to estimate this state. The first one considers each state to be associated with a context, possibly corrupted, allowing the agent to learn the context-to-state mapping. The second one considers that the state itself evolves in a Markovian fashion, thus allowing the agent to estimate the current state based on history. We argue that it is realistic for the agent to have access to both these sources of information, i.e., an arbitrarily corrupted context obeying the Markov property. Thus, the agent is faced with a new challenge of balancing its belief about the reliability of information from learned state transitions versus context information. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Users transition through different unobserved, time-correlated but only partially observable internal states, which determine their current needs. The side-information about users might not always be reliable and standard approaches solely relying on the context risk incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We evaluate our method on simulated data and on several real-world data sets, showing improved empirical performance compared to several popular algorithms.
    Improved rates for prediction and identification of partially observed linear dynamical systems. (arXiv:2011.10006v3 [cs.LG] UPDATED)
    Identification of a linear time-invariant dynamical system from partial observations is a fundamental problem in control theory. Particularly challenging are systems exhibiting long-term memory. A natural question is how learn such systems with non-asymptotic statistical rates depending on the inherent dimensionality (order) $d$ of the system, rather than on the possibly much larger memory length. We propose an algorithm that given a single trajectory of length $T$ with gaussian observation noise, learns the system with a near-optimal rate of $\widetilde O\left(\sqrt\frac{d}{T}\right)$ in $\mathcal{H}_2$ error, with only logarithmic, rather than polynomial dependence on memory length. We also give bounds under process noise and improved bounds for learning a realization of the system. Our algorithm is based on multi-scale low-rank approximation: SVD applied to Hankel matrices of geometrically increasing sizes. Our analysis relies on careful application of concentration bounds on the Fourier domain -- we give sharper concentration bounds for sample covariance of correlated inputs and for $\mathcal H_\infty$ norm estimation, which may be of independent interest.
    Scale Mixtures of Neural Network Gaussian Processes. (arXiv:2107.01408v2 [stat.ML] UPDATED)
    Recent works have revealed that infinitely-wide feed-forward or recurrent neural networks of any architecture correspond to Gaussian processes referred to as Neural Network Gaussian Processes (NNGPs). While these works have extended the class of neural networks converging to Gaussian processes significantly, however, there has been little focus on broadening the class of stochastic processes that such neural networks converge to. In this work, inspired by the scale mixture of Gaussian random variables, we propose the scale mixture of NNGPs for which we introduce a prior distribution on the scale of the last-layer parameters. We show that simply introducing a scale prior on the last-layer parameters can turn infinitely-wide neural networks of any architecture into a richer class of stochastic processes. With certain scale priors, we obtain heavy-tailed stochastic processes, and in the case of inverse gamma priors, we recover Student's $t$ processes. We further analyze the distributions of the neural networks initialized with our prior setting and trained with gradient descents and obtain similar results as for NNGPs. We present a practical posterior-inference algorithm for the scale mixture of NNGPs and empirically demonstrate its usefulness on regression and classification tasks. In particular, we show that in both tasks, the heavy-tailed stochastic processes obtained from our framework are robust to out-of-distribution data.
    Thompson Sampling with a Mixture Prior. (arXiv:2106.05608v2 [cs.LG] UPDATED)
    We study Thompson sampling (TS) in online decision making, where the uncertain environment is sampled from a mixture distribution. This is relevant in multi-task learning, where a learning agent faces different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior, and call the resulting algorithm MixTS. To analyze MixTS, we develop a novel and general proof technique for analyzing the concentration of mixture distributions. We use it to prove Bayes regret bounds for MixTS in both linear bandits and finite-horizon reinforcement learning. Our bounds capture the structure of the prior, depend on the number of mixture components and their widths. We also demonstrate the empirical effectiveness of MixTS in synthetic and real-world experiments.
    ARM 4-BIT PQ: SIMD-based Acceleration for Approximate Nearest Neighbor Search on ARM. (arXiv:2203.02505v1 [cs.LG])
    We accelerate the 4-bit product quantization (PQ) on the ARM architecture. Notably, the drastic performance of the conventional 4-bit PQ strongly relies on x64-specific SIMD register, such as AVX2; hence, we cannot yet achieve such good performance on ARM. To fill this gap, we first bundle two 128-bit registers as one 256-bit component. We then apply shuffle operations for each using the ARM-specific NEON instruction. By making this simple but critical modification, we achieve a dramatic speedup for the 4-bit PQ on an ARM architecture. Experiments show that the proposed method consistently achieves a 10x improvement over the naive PQ with the same accuracy.
    Bayesian Learning Approach to Model Predictive Control. (arXiv:2203.02720v1 [cs.LG])
    This study presents a Bayesian learning perspective towards model predictive control algorithms. High-level frameworks have been developed separately in the earlier studies on Bayesian learning and sampling-based model predictive control. On one hand, the Bayesian learning rule provides a general framework capable of generating various machine learning algorithms as special instances. On the other hand, the dynamic mirror descent model predictive control framework is capable of diversifying sample-rollout-based control algorithms. However, connections between the two frameworks have still not been fully appreciated in the context of stochastic optimal control. This study combines the Bayesian learning rule point of view into the model predictive control setting by taking inspirations from the view of understanding model predictive controller as an online learner. The selection of posterior class and natural gradient approximation for the variational formulation governs diversification of model predictive control algorithms in the Bayesian learning approach to model predictive control. This alternative viewpoint complements the dynamic mirror descent framework through streamlining the explanation of design choices.
    Compartmental Models for COVID-19 and Control via Policy Interventions. (arXiv:2203.02860v1 [stat.ML])
    We demonstrate an approach to replicate and forecast the spread of the SARS-CoV-2 (COVID-19) pandemic using the toolkit of probabilistic programming languages (PPLs). Our goal is to study the impact of various modeling assumptions and motivate policy interventions enacted to limit the spread of infectious diseases. Using existing compartmental models we show how to use inference in PPLs to obtain posterior estimates for disease parameters. We improve popular existing models to reflect practical considerations such as the under-reporting of the true number of COVID-19 cases and motivate the need to model policy interventions for real-world data. We design an SEI3RD model as a reusable template and demonstrate its flexibility in comparison to other models. We also provide a greedy algorithm that selects the optimal series of policy interventions that are likely to control the infected population subject to provided constraints. We work within a simple, modular, and reproducible framework to enable immediate cross-domain access to the state-of-the-art in probabilistic inference with emphasis on policy interventions. We are not epidemiologists; the sole aim of this study is to serve as an exposition of methods, not to directly infer the real-world impact of policy-making for COVID-19.
    Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions. (arXiv:2203.02605v1 [stat.ML])
    Reinforcement learning (RL) is acquiring a key role in the space of adaptive interventions (AIs), attracting a substantial interest within methodological and theoretical literature and becoming increasingly popular within health sciences. Despite potential benefits, its application in real life is still limited due to several operational and statistical challenges--in addition to ethical and cost issues among others--that remain open in part due to poor communication and synergy between methodological and applied scientists. In this work, we aim to bridge the different domains that contribute to and may benefit from RL, under a unique framework that intersects the areas of RL, causal inference, and AIs, among others. We provide the first unified instructive survey on RL methods for building AIs, encompassing both dynamic treatment regimes (DTRs) and just-in-time adaptive interventions in mobile health (mHealth). We outline similarities and differences between the two areas, and discuss their implications for using RL. We combine our relevant methodological knowledge with motivating studies in both DTRs and mHealth to illustrate the tremendous collaboration opportunities between statistical, RL, and healthcare researchers in the space of AIs.
    Koopman operator for time-dependent reliability analysis. (arXiv:2203.02658v1 [stat.ML])
    Time-dependent structural reliability analysis of nonlinear dynamical systems is non-trivial; subsequently, scope of most of the structural reliability analysis methods is limited to time-independent reliability analysis only. In this work, we propose a Koopman operator based approach for time-dependent reliability analysis of nonlinear dynamical systems. Since the Koopman representations can transform any nonlinear dynamical system into a linear dynamical system, the time evolution of dynamical systems can be obtained by Koopman operators seamlessly regardless of the nonlinear or chaotic behavior. Despite the fact that the Koopman theory has been in vogue a long time back, identifying intrinsic coordinates is a challenging task; to address this, we propose an end-to-end deep learning architecture that learns the Koopman observables and then use it for time marching the dynamical response. Unlike purely data-driven approaches, the proposed approach is robust even in the presence of uncertainties; this renders the proposed approach suitable for time-dependent reliability analysis. We propose two architectures; one suitable for time-dependent reliability analysis when the system is subjected to random initial condition and the other suitable when the underlying system have uncertainties in system parameters. The proposed approach is robust and generalizes to unseen environment (out-of-distribution prediction). Efficacy of the proposed approached is illustrated using three numerical examples. Results obtained indicate supremacy of the proposed approach as compared to purely data-driven auto-regressive neural network and long-short term memory network.
    Recursive Reasoning Graph for Multi-Agent Reinforcement Learning. (arXiv:2203.02844v1 [cs.LG])
    Multi-agent reinforcement learning (MARL) provides an efficient way for simultaneously learning policies for multiple agents interacting with each other. However, in scenarios requiring complex interactions, existing algorithms can suffer from an inability to accurately anticipate the influence of self-actions on other agents. Incorporating an ability to reason about other agents' potential responses can allow an agent to formulate more effective strategies. This paper adopts a recursive reasoning model in a centralized-training-decentralized-execution framework to help learning agents better cooperate with or compete against others. The proposed algorithm, referred to as the Recursive Reasoning Graph (R2G), shows state-of-the-art performance on multiple multi-agent particle and robotics games.
    Hybrid Deep Learning Model using SPCAGAN Augmentation for Insider Threat Analysis. (arXiv:2203.02855v1 [cs.CR])
    Cyberattacks from within an organization's trusted entities are known as insider threats. Anomaly detection using deep learning requires comprehensive data, but insider threat data is not readily available due to confidentiality concerns of organizations. Therefore, there arises demand to generate synthetic data to explore enhanced approaches for threat analysis. We propose a linear manifold learning-based generative adversarial network, SPCAGAN, that takes input from heterogeneous data sources and adds a novel loss function to train the generator to produce high-quality data that closely resembles the original data distribution. Furthermore, we introduce a deep learning-based hybrid model for insider threat analysis. We provide extensive experiments for data synthesis, anomaly detection, adversarial robustness, and synthetic data quality analysis using benchmark datasets. In this context, empirical comparisons show that GAN-based oversampling is competitive with numerous typical oversampling regimes. For synthetic data generation, our SPCAGAN model overcame the problem of mode collapse and converged faster than previous GAN models. Results demonstrate that our proposed approach has a lower error, is more accurate, and generates substantially superior synthetic insider threat data than previous models.
    Fine-Tuning Pretrained Language Models With Label Attention for Biomedical Text Classification. (arXiv:2108.11809v3 [cs.CL] UPDATED)
    The massive scale and growth of textual biomedical data have made its indexing and classification increasingly important. However, existing research on this topic mainly utilized convolutional and recurrent neural networks, which generally achieve inferior performance than the novel transformers. On the other hand, systems that apply transformers only focus on the target documents, overlooking the rich semantic information that label descriptions contain. To address this gap, we develop a transformer-based biomedical text classifier that considers label information. The system achieves this with a label attention module incorporated into the fine-tuning process of pretrained language models (PTMs). Our results on two public medical datasets show that the proposed fine-tuning scheme outperforms the vanilla PTMs and state-of-the-art models.
    Satellite Image-based Localization via Learned Embeddings. (arXiv:1704.01133v2 [cs.RO] UPDATED)
    We propose a vision-based method that localizes a ground vehicle using publicly available satellite imagery as the only prior knowledge of the environment. Our approach takes as input a sequence of ground-level images acquired by the vehicle as it navigates, and outputs an estimate of the vehicle's pose relative to a georeferenced satellite image. We overcome the significant viewpoint and appearance variations between the images through a neural multi-view model that learns location-discriminative embeddings in which ground-level images are matched with their corresponding satellite view of the scene. We use this learned function as an observation model in a filtering framework to maintain a distribution over the vehicle's pose. We evaluate our method on different benchmark datasets and demonstrate its ability localize ground-level images in environments novel relative to training, despite the challenges of significant viewpoint and appearance variations.
    Unfreeze with Care: Space-Efficient Fine-Tuning of Semantic Parsing Models. (arXiv:2203.02652v1 [cs.CL])
    Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pretrained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.
    Geometric Transformers for Protein Interface Contact Prediction. (arXiv:2110.02423v5 [cs.LG] UPDATED)
    Computational methods for predicting the interface contacts between proteins come highly sought after for drug discovery as they can significantly advance the accuracy of alternative approaches, such as protein-protein docking, protein function analysis tools, and other computational methods for protein bioinformatics. In this work, we present the Geometric Transformer, a novel geometry-evolving graph transformer for rotation and translation-invariant protein interface contact prediction, packaged within DeepInteract, an end-to-end prediction pipeline. DeepInteract predicts partner-specific protein interface contacts (i.e., inter-protein residue-residue contacts) given the 3D tertiary structures of two proteins as input. In rigorous benchmarks, DeepInteract, on challenging protein complex targets from the 13th and 14th CASP-CAPRI experiments as well as Docking Benchmark 5, achieves 14% and 1.1% top L/5 precision (L: length of a protein unit in a complex), respectively. In doing so, DeepInteract, with the Geometric Transformer as its graph-based backbone, outperforms existing methods for interface contact prediction in addition to other graph-based neural network backbones compatible with DeepInteract, thereby validating the effectiveness of the Geometric Transformer for learning rich relational-geometric features for downstream tasks on 3D protein structures.
    Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators. (arXiv:2109.14830v2 [cs.AI] UPDATED)
    Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain.
    Machine Learning Simulates Agent-Based Model Towards Policy. (arXiv:2203.02576v1 [cs.MA])
    Public Policies are not intrinsically positive or negative. Rather, policies provide varying levels of effects across different recipients. Methodologically, computational modeling enables the application of a combination of multiple influences on empirical data, thus allowing for heterogeneous response to policies. We use a random forest machine learning algorithm to emulate an agent-based model (ABM) and evaluate competing policies across 46 Metropolitan Regions (MRs) in Brazil. In doing so, we use input parameters and output indicators of 11,076 actual simulation runs and one million emulated runs. As a result, we obtain the optimal (and non-optimal) performance of each region over the policies. Optimum is defined as a combination of production and inequality indicators for the full ensemble of MRs. Results suggest that MRs already have embedded structures that favor optimal or non-optimal results, but they also illustrate which policy is more beneficial to each place. In addition to providing MR-specific policies' results, the use of machine learning to simulate an ABM reduces the computational burden, whereas allowing for a much larger variation among model parameters. The coherence of results within the context of larger uncertainty -- vis-\`a-vis those of the original ABM -- suggests an additional test of robustness of the model. At the same time the exercise indicates which parameters should policymakers intervene, in order to work towards optimum of MRs.
    Active Hierarchical Exploration with Stable Subgoal Representation Learning. (arXiv:2105.14750v3 [cs.LG] UPDATED)
    Goal-conditioned hierarchical reinforcement learning (GCHRL) provides a promising approach to solving long-horizon tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. Although GCHRL possesses superior exploration ability by decomposing tasks via subgoals, existing GCHRL methods struggle in temporally extended tasks with sparse external rewards, since the high-level policy learning relies on external rewards. As the high-level policy selects subgoals in an online learned representation space, the dynamic change of the subgoal space severely hinders effective high-level exploration. In this paper, we propose a novel regularization that contributes to both stable and efficient subgoal representation learning. Building upon the stable representation, we design measures of novelty and potential for subgoals, and develop an active hierarchical exploration strategy that seeks out new promising subgoals and states without intrinsic rewards. Experimental results show that our approach significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards.
    Meta Mirror Descent: Optimiser Learning for Fast Convergence. (arXiv:2203.02711v1 [cs.LG])
    Optimisers are an essential component for training machine learning models, and their design influences learning speed and generalisation. Several studies have attempted to learn more effective gradient-descent optimisers via solving a bi-level optimisation problem where generalisation error is minimised with respect to optimiser parameters. However, most existing optimiser learning methods are intuitively motivated, without clear theoretical support. We take a different perspective starting from mirror descent rather than gradient descent, and meta-learning the corresponding Bregman divergence. Within this paradigm, we formalise a novel meta-learning objective of minimising the regret bound of learning. The resulting framework, termed Meta Mirror Descent (MetaMD), learns to accelerate optimisation speed. Unlike many meta-learned optimisers, it also supports convergence and generalisation guarantees and uniquely does so without requiring validation data. We evaluate our framework on a variety of tasks and architectures in terms of convergence rate and generalisation error and demonstrate strong performance.
    Cellular Segmentation and Composition in Routine Histology Images using Deep Learning. (arXiv:2203.02510v1 [q-bio.QM])
    Identification and quantification of nuclei in colorectal cancer haematoxylin \& eosin (H\&E) stained histology images is crucial to prognosis and patient management. In computational pathology these tasks are referred to as nuclear segmentation, classification and composition and are used to extract meaningful interpretable cytological and architectural features for downstream analysis. The CoNIC challenge poses the task of automated nuclei segmentation, classification and composition into six different types of nuclei from the largest publicly known nuclei dataset - Lizard. In this regard, we have developed pipelines for the prediction of nuclei segmentation using HoVer-Net and ALBRT for cellular composition. On testing on the preliminary test set, HoVer-Net achieved a PQ of 0.58, a PQ+ of 0.58 and finally a mPQ+ of 0.35. For the prediction of cellular composition with ALBRT on the preliminary test set, we achieved an overall $R^2$ score of 0.53, consisting of 0.84 for lymphocytes, 0.70 for epithelial cells, 0.70 for plasma and .060 for eosinophils.
    ECMG: Exemplar-based Commit Message Generation. (arXiv:2203.02700v1 [cs.SE])
    Commit messages concisely describe the content of code diffs (i.e., code changes) and the intent behind them. Recently, many approaches have been proposed to generate commit messages automatically. The information retrieval-based methods reuse the commit messages of similar code diffs, while the neural-based methods learn the semantic connection between code diffs and commit messages. However, the reused commit messages might not accurately describe the content/intent of code diffs and neural-based methods tend to generate high-frequent and repetitive tokens in the corpus. In this paper, we combine the advantages of the two technical routes and propose a novel exemplar-based neural commit message generation model, which treats the similar commit message as an exemplar and leverages it to guide the neural network model to generate an accurate commit message. We perform extensive experiments and the results confirm the effectiveness of our model.
    What is a meaningful representation of protein sequences?. (arXiv:2012.02679v4 [q-bio.BM] UPDATED)
    How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.
    Concept-based Explanations for Out-Of-Distribution Detectors. (arXiv:2203.02586v1 [cs.LG])
    Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.
    Safe Reinforcement Learning for Legged Locomotion. (arXiv:2203.02638v1 [cs.RO])
    Designing control policies for legged locomotion is complex due to the under-actuated and non-continuous robot dynamics. Model-free reinforcement learning provides promising tools to tackle this challenge. However, a major bottleneck of applying model-free reinforcement learning in real world is safety. In this paper, we propose a safe reinforcement learning framework that switches between a safe recovery policy that prevents the robot from entering unsafe states, and a learner policy that is optimized to complete the task. The safe recovery policy takes over the control when the learner policy violates safety constraints, and hands over the control back when there are no future safety violations. We design the safe recovery policy so that it ensures safety of legged locomotion while minimally intervening in the learning process. Furthermore, we theoretically analyze the proposed framework and provide an upper bound on the task performance. We verify the proposed framework in four locomotion tasks on a simulated and real quadrupedal robot: efficient gait, catwalk, two-leg balance, and pacing. On average, our method achieves 48.6% fewer falls and comparable or better rewards than the baseline methods in simulation. When deployed it on real-world quadruped robot, our training pipeline enables 34% improvement in energy efficiency for the efficient gait, 40.9% narrower of the feet placement in the catwalk, and two times more jumping duration in the two-leg balance. Our method achieves less than five falls over the duration of 115 minutes of hardware time.
    Machine Learning Applications in Diagnosis, Treatment and Prognosis of Lung Cancer. (arXiv:2203.02794v1 [cs.LG])
    The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this article, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
    BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation. (arXiv:2203.02533v1 [eess.IV])
    In this paper, we propose a novel semi-supervised learning (SSL) framework named BoostMIS that combines adaptive pseudo labeling and informative active annotation to unleash the potential of medical image SSL models: (1) BoostMIS can adaptively leverage the cluster assumption and consistency regularization of the unlabeled data according to the current learning status. This strategy can adaptively generate one-hot ``hard'' labels converted from task model predictions for better task model training. (2) For the unselected unlabeled images with low confidence, we introduce an Active learning (AL) algorithm to find the informative samples as the annotation candidates by exploiting virtual adversarial perturbation and model's density-aware entropy. These informative candidates are subsequently fed into the next training cycle for better SSL label propagation. Notably, the adaptive pseudo-labeling and informative active annotation form a learning closed-loop that are mutually collaborative to boost medical image SSL. To verify the effectiveness of the proposed method, we collected a metastatic epidural spinal cord compression (MESCC) dataset that aims to optimize MESCC diagnosis and classification for improved specialist referral and treatment. We conducted an extensive experimental study of BoostMIS on MESCC and another public dataset COVIDx. The experimental results verify our framework's effectiveness and generalisability for different medical image datasets with a significant improvement over various state-of-the-art methods.
    GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation. (arXiv:2203.02923v1 [q-bio.QM])
    Predicting molecular conformations from molecular graphs is a fundamental problem in cheminformatics and drug discovery. Recently, significant progress has been achieved with machine learning approaches, especially with deep generative models. Inspired by the diffusion process in classical non-equilibrium thermodynamics where heated particles will diffuse from original states to a noise distribution, in this paper, we propose a novel generative model named GeoDiff for molecular conformation prediction. GeoDiff treats each atom as a particle and learns to directly reverse the diffusion process (i.e., transforming from a noise distribution to stable conformations) as a Markov chain. Modeling such a generation process is however very challenging as the likelihood of conformations should be roto-translational invariant. We theoretically show that Markov chains evolving with equivariant Markov kernels can induce an invariant distribution by design, and further propose building blocks for the Markov kernels to preserve the desirable equivariance property. The whole framework can be efficiently trained in an end-to-end fashion by optimizing a weighted variational lower bound to the (conditional) likelihood. Experiments on multiple benchmarks show that GeoDiff is superior or comparable to existing state-of-the-art approaches, especially on large molecules.
    Building 3D Generative Models from Minimal Data. (arXiv:2203.02554v1 [cs.CV])
    We propose a method for constructing generative models of 3D objects from a single 3D mesh and improving them through unsupervised low-shot learning from 2D images. Our method produces a 3D morphable model that represents shape and albedo in terms of Gaussian processes. Whereas previous approaches have typically built 3D morphable models from multiple high-quality 3D scans through principal component analysis, we build 3D morphable models from a single scan or template. As we demonstrate in the face domain, these models can be used to infer 3D reconstructions from 2D data (inverse graphics) or 3D data (registration). Specifically, we show that our approach can be used to perform face recognition using only a single 3D template (one scan total, not one per person). We extend our model to a preliminary unsupervised learning framework that enables the learning of the distribution of 3D faces using one 3D template and a small number of 2D images. This approach could also provide a model for the origins of face perception in human infants, who appear to start with an innate face template and subsequently develop a flexible system for perceiving the 3D structure of any novel face from experience with only 2D images of a relatively small number of familiar faces.
    Target Network and Truncation Overcome The Deadly triad in $Q$-Learning. (arXiv:2203.02628v1 [cs.LG])
    $Q$-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms, and was identified in Sutton (1999) as one of the most important theoretical open problems in the RL community. Even in the basic linear function approximation setting, there are well-known divergent examples. In this work, we propose a stable design for $Q$-learning with linear function approximation using target network and truncation, and establish its finite-sample guarantees. Our result implies an $\mathcal{O}(\epsilon^{-2})$ sample complexity up to a function approximation error. This is the first variant of $Q$-learning with linear function approximation that is provably stable without requiring strong assumptions or modifying the problem parameters, and achieves the optimal sample complexity.
    Geodesic Gramian Denoising Applied to the Images Contaminated With Noise Sampled From Diverse Probability Distributions. (arXiv:2203.02600v1 [eess.IV])
    As quotidian use of sophisticated cameras surges, people in modern society are more interested in capturing fine-quality images. However, the quality of the images might be inferior to people's expectations due to the noise contamination in the images. Thus, filtering out the noise while preserving vital image features is an essential requirement. Current existing denoising methods have their own assumptions on the probability distribution in which the contaminated noise is sampled for the method to attain its expected denoising performance. In this paper, we utilize our recent Gramian-based filtering scheme to remove noise sampled from five prominent probability distributions from selected images. This method preserves image smoothness by adopting patches partitioned from the image, rather than pixels, and retains vital image features by performing denoising on the manifold underlying the patch space rather than in the image domain. We validate its denoising performance, using three benchmark computer vision test images applied to two state-of-the-art denoising methods, namely BM3D and K-SVD.
    Enabling Automated Machine Learning for Model-Driven AI Engineering. (arXiv:2203.02927v1 [cs.SE])
    Developing smart software services requires both Software Engineering and Artificial Intelligence (AI) skills. AI practitioners, such as data scientists often focus on the AI side, for example, creating and training Machine Learning (ML) models given a specific use case and data. They are typically not concerned with the entire software development life-cycle, architectural decisions for the system and performance issues beyond the predictive ML models (e.g., regarding the security, privacy, throughput, scalability, availability, as well as ethical, legal and regulatory compliance). In this manuscript, we propose a novel approach to enable Model-Driven Software Engineering and Model-Driven AI Engineering. In particular, we support Automated ML, thus assisting software engineers without deep AI knowledge in developing AI-intensive systems by choosing the most appropriate ML model, algorithm and techniques with suitable hyper-parameters for the task at hand. To validate our work, we carry out a case study in the smart energy domain.
    Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election. (arXiv:2203.02818v1 [cs.LG])
    An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election.
    Acceleration of Federated Learning with Alleviated Forgetting in Local Training. (arXiv:2203.02645v1 [cs.LG])
    Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy by independently training local models on each client and then aggregating parameters on a central server, thereby producing an effective global model. Although a variety of FL algorithms have been proposed, their training efficiency remains low when the data are not independently and identically distributed (non-i.i.d.) across different clients. We observe that the slow convergence rates of the existing methods are (at least partially) caused by the catastrophic forgetting issue during the local training stage on each individual client, which leads to a large increase in the loss function concerning the previous training data at the other clients. Here, we propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage by regularizing locally trained parameters with the loss on generated pseudo data, which encode the knowledge of previous training data learned by the global model. Our comprehensive experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep and the clients' data are extremely non-i.i.d., but is also able to protect privacy better in classification problems and more robust against gradient inversion attacks. The code is available at: https://github.com/Zoesgithub/FedReg.
    An Operator Theoretic Perspective on Pruning Deep Neural Networks. (arXiv:2110.14856v2 [cs.LG] UPDATED)
    The discovery of sparse subnetworks that are able to perform as well as full models has found broad applied and theoretical interest. While many pruning methods have been developed to this end, the na\"ive approach of removing parameters based on their magnitude has been found to be as robust as more complex, state-of-the-art algorithms. The lack of theory behind magnitude pruning's success, especially pre-convergence, and its relation to other pruning methods, such as gradient based pruning, are outstanding open questions in the field that are in need of being addressed. We make use of recent advances in dynamical systems theory, namely Koopman operator theory, to define a new class of theoretically motivated pruning algorithms. We show that these algorithms can be equivalent to magnitude and gradient based pruning, unifying these seemingly disparate methods, and find that they can be used to shed light on magnitude pruning's performance during the early part of training.
    Watch from sky: machine-learning-based multi-UAV network for predictive police surveillance. (arXiv:2203.02892v1 [cs.LG])
    This paper presents the watch-from-sky framework, where multiple unmanned aerial vehicles (UAVs) play four roles, i.e., sensing, data forwarding, computing, and patrolling, for predictive police surveillance. Our framework is promising for crime deterrence because UAVs are useful for collecting and distributing data and have high mobility. Our framework relies on machine learning (ML) technology for controlling and dispatching UAVs and predicting crimes. This paper compares the conceptual model of our framework against the literature. It also reports a simulation of UAV dispatching using reinforcement learning and distributed ML inference over a lossy UAV network.
    Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python. (arXiv:2112.08429v2 [cs.LG] UPDATED)
    Modern deep learning frameworks provide imperative, eager execution programming interfaces embedded in Python to provide a productive development experience. However, deep learning practitioners sometimes need to capture and transform program structure for performance optimization, visualization, analysis, and hardware integration. We study the different designs for program capture and transformation used in deep learning. By designing for typical deep learning use cases rather than long tail ones, it is possible to create a simpler framework for program capture and transformation. We apply this principle in torch.fx, a program capture and transformation library for PyTorch written entirely in Python and optimized for high developer productivity by ML practitioners. We present case studies showing how torch.fx enables workflows previously inaccessible in the PyTorch ecosystem.
    How to Train Unstable Looped Tensor Network. (arXiv:2203.02617v1 [cs.LG])
    A rising problem in the compression of Deep Neural Networks is how to reduce the number of parameters in convolutional kernels and the complexity of these layers by low-rank tensor approximation. Canonical polyadic tensor decomposition (CPD) and Tucker tensor decomposition (TKD) are two solutions to this problem and provide promising results. However, CPD often fails due to degeneracy, making the networks unstable and hard to fine-tune. TKD does not provide much compression if the core tensor is big. This motivates using a hybrid model of CPD and TKD, a decomposition with multiple Tucker models with small core tensor, known as block term decomposition (BTD). This paper proposes a more compact model that further compresses the BTD by enforcing core tensors in BTD identical. We establish a link between the BTD with shared parameters and a looped chain tensor network (TC). Unfortunately, such strongly constrained tensor networks (with loop) encounter severe numerical instability, as proved by y (Landsberg, 2012) and (Handschuh, 2015a). We study perturbation of chain tensor networks, provide interpretation of instability in TC, demonstrate the problem. We propose novel methods to gain the stability of the decomposition results, keep the network robust and attain better approximation. Experimental results will confirm the superiority of the proposed methods in compression of well-known CNNs, and TC decomposition under challenging scenarios
    Ensemble Knowledge Guided Sub-network Search and Fine-tuning for Filter Pruning. (arXiv:2203.02651v1 [cs.LG])
    Conventional NAS-based pruning algorithms aim to find the sub-network with the best validation performance. However, validation performance does not successfully represent test performance, i.e., potential performance. Also, although fine-tuning the pruned network to restore the performance drop is an inevitable process, few studies have handled this issue. This paper proposes a novel sub-network search and fine-tuning method that is named Ensemble Knowledge Guidance (EKG). First, we experimentally prove that the fluctuation of the loss landscape is an effective metric to evaluate the potential performance. In order to search a sub-network with the smoothest loss landscape at a low cost, we propose a pseudo-supernet built by an ensemble sub-network knowledge distillation. Next, we propose a novel fine-tuning that re-uses the information of the search phase. We store the interim sub-networks, that is, the by-products of the search phase, and transfer their knowledge into the pruned network. Note that EKG is easy to be plugged-in and computationally efficient. For example, in the case of ResNet-50, about 45% of FLOPS is removed without any performance drop in only 315 GPU hours. The implemented code is available at https://github.com/sseung0703/EKG.
    Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. (arXiv:2203.02592v1 [stat.ML])
    The information bottleneck framework provides a systematic approach to learn representations that compress nuisance information in inputs and extract semantically meaningful information about the predictions. However, the choice of the prior distribution that fix the dimensionality across all the data can restrict the flexibility of this approach to learn robust representations. We present a novel sparsity-inducing spike-slab prior that uses sparsity as a mechanism to provide flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism to learn a joint distribution of the latent variable and the sparsity. Thus, unlike other approaches, it can account for the full uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST and Fashion-MNIST data we show that the proposed approach improves the accuracy and robustness compared with the traditional fixed -imensional priors as well as other sparsity-induction mechanisms proposed in the literature.
    Deep Partial Multiplex Network Embedding. (arXiv:2203.02656v1 [cs.LG])
    Network embedding is an effective technique to learn the low-dimensional representations of nodes in networks. Real-world networks are usually with multiplex or having multi-view representations from different relations. Recently, there has been increasing interest in network embedding on multiplex data. However, most existing multiplex approaches assume that the data is complete in all views. But in real applications, it is often the case that each view suffers from the missing of some data and therefore results in partial multiplex data. In this paper, we present a novel Deep Partial Multiplex Network Embedding approach to deal with incomplete data. In particular, the network embeddings are learned by simultaneously minimizing the deep reconstruction loss with the autoencoder neural network, enforcing the data consistency across views via common latent subspace learning, and preserving the data topological structure within the same network through graph Laplacian. We further prove the orthogonal invariant property of the learned embeddings and connect our approach with the binary embedding techniques. Experiments on four multiplex benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art methods on node classification, link prediction and clustering tasks.
    Non-linear predictive vector quantization of speech. (arXiv:2203.02506v1 [cs.LG])
    In this paper we propose a Non-Linear Predictive Vector quantizer (PVQ) for speech coding, based on Multi-Layer Perceptrons. We also propose a method to evaluate if a quantizer is well designed, and if it exploits the correlation between consecutive outputs. Although the results of the Non-linear PVQ do not improve the results of the non-linear scalar predictor, we check that there is some room for the PVQ improvement.
    Fast Computation of Generalized Eigenvectors for Manifold Graph Embedding. (arXiv:2112.07862v2 [eess.SP] UPDATED)
    Our goal is to efficiently compute low-dimensional latent coordinates for nodes in an input graph -- known as graph embedding -- for subsequent data processing such as clustering. Focusing on finite graphs that are interpreted as uniform samples on continuous manifolds (called manifold graphs), we leverage existing fast extreme eigenvector computation algorithms for speedy execution. We first pose a generalized eigenvalue problem for sparse matrix pair $(\A,\B)$, where $\A = \L - \mu \Q + \epsilon \I$ is a sum of graph Laplacian $\L$ and disconnected two-hop difference matrix $\Q$. Eigenvector $\v$ minimizing Rayleigh quotient $\frac{\v^{\top} \A \v}{\v^{\top} \v}$ thus minimizes $1$-hop neighbor distances while maximizing distances between disconnected $2$-hop neighbors, preserving graph structure. Matrix $\B = \text{diag}(\{\b_i\})$ that defines eigenvector orthogonality is then chosen so that boundary / interior nodes in the sampling domain have the same generalized degrees. $K$-dimensional latent vectors for the $N$ graph nodes are the first $K$ generalized eigenvectors for $(\A,\B)$, computed in $\cO(N)$ using LOBPCG, where $K \ll N$. Experiments show that our embedding is among the fastest in the literature, while producing the best clustering performance for manifold graphs.
    A Similarity-based Framework for Classification Task. (arXiv:2203.02669v1 [cs.LG])
    Similarity-based method gives rise to a new class of methods for multi-label learning and also achieves promising performance. In this paper, we generalize this method, resulting in a new framework for classification task. Specifically, we unite similarity-based learning and generalized linear models to achieve the best of both worlds. This allows us to capture interdependencies between classes and prevent from impairing performance of noisy classes. Each learned parameter of the model can reveal the contribution of one class to another, providing interpretability to some extent. Experiment results show the effectiveness of the proposed approach on multi-class and multi-label datasets
    Important Object Identification with Semi-Supervised Learning for Autonomous Driving. (arXiv:2203.02634v1 [cs.CV])
    Accurate identification of important objects in the scene is a prerequisite for safe and high-quality decision making and motion planning of intelligent agents (e.g., autonomous vehicles) that navigate in complex and dynamic environments. Most existing approaches attempt to employ attention mechanisms to learn importance weights associated with each object indirectly via various tasks (e.g., trajectory prediction), which do not enforce direct supervision on the importance estimation. In contrast, we tackle this task in an explicit way and formulate it as a binary classification ("important" or "unimportant") problem. We propose a novel approach for important object identification in egocentric driving scenarios with relational reasoning on the objects in the scene. Besides, since human annotations are limited and expensive to obtain, we present a semi-supervised learning pipeline to enable the model to learn from unlimited unlabeled data. Moreover, we propose to leverage the auxiliary tasks of ego vehicle behavior prediction to further improve the accuracy of importance estimation. The proposed approach is evaluated on a public egocentric driving dataset (H3D) collected in complex traffic scenarios. A detailed ablative study is conducted to demonstrate the effectiveness of each model component and the training strategy. Our approach also outperforms rule-based baselines by a large margin.
    Evolving symbolic density functionals. (arXiv:2203.02540v1 [cs.NE])
    Systematic development of accurate density functionals has been a decades-long challenge for scientists. Despite the emerging application of machine learning (ML) in approximating functionals, the resulting ML functionals usually contain more than tens of thousands parameters, which makes a huge gap in the formulation with the conventional human-designed symbolic functionals. We propose a new framework, Symbolic Functional Evolutionary Search (SyFES), that automatically constructs accurate functionals in the symbolic form, which is more explainable to humans, cheaper to evaluate, and easier to integrate to existing density functional theory codes than other ML functionals. We first show that without prior knowledge, SyFES reconstructed a known functional from scratch. We then demonstrate that evolving from an existing functional $\omega$B97M-V, SyFES found a new functional, GAS22 (Google Accelerated Science 22), that performs better on main-group chemistry. Our framework opens a new direction in leveraging computing power for the systematic development of symbolic density functionals.
    Style-ERD: Responsive and Coherent Online Motion Style Transfer. (arXiv:2203.02574v1 [cs.CV])
    Motion style transfer is a common method for enriching character animation. Motion style transfer algorithms are often designed for offline settings where motions are processed in segments. However, for online animation applications, such as realtime avatar animation from motion capture, motions need to be processed as a stream with minimal latency. In this work, we realize a flexible, high-quality motion style transfer method for this setting. We propose a novel style transfer model, Style-ERD, to stylize motions in an online manner with an Encoder-Recurrent-Decoder structure, along with a novel discriminator that combines feature attention and temporal attention. Our method stylizes motions into multiple target styles with a unified model. Although our method targets online settings, it outperforms previous offline methods in motion realism and style expressiveness and provides significant gains in runtime efficiency
  • Open

    High-Resolution Peak Demand Estimation Using Generalized Additive Models and Deep Neural Networks. (arXiv:2203.03342v1 [cs.LG])
    This paper presents a method for estimating high-resolution electricity peak demand given lower resolution data. The technique won a data competition organized by the British distribution network operator Western Power Distribution. The exercise was to estimate the minimum and maximum load values in a single substation in a one-minute resolution as precisely as possible. In contrast, the data was given in half-hourly and hourly resolutions. The winning method combines generalized additive models (GAM) and deep artificial neural networks (DNN) which are popular in load forecasting. We provide an extensive analysis of the prediction models, including the importance of input parameters with a focus on load, weather, and seasonal effects. In addition, we provide a rigorous evaluation study that goes beyond the competition frame to analyze the robustness. The results show that the proposed methods are superior, not only in the single competition month but also in the meaningful evaluation study.
    Smoothing with the Best Rectangle Window is Optimal for All Tapered Rectangle Windows. (arXiv:2203.02997v1 [stat.ML])
    We investigate the optimal selection of weight windows for the problem of weighted least squares. We show that weight windows should be symmetric around its center, which is also its peak. We consider the class of tapered rectangle window weights, which are nonincreasing away from the center. We show that the best rectangle window is optimal for such window definitions. We also extend our results to the least absolutes and more general case of arbitrary loss functions to find similar results.
    Anomaly Detection by Recombining Gated Unsupervised Experts. (arXiv:2008.13763v4 [cs.LG] UPDATED)
    Anomaly detection has been considered under several extents of prior knowledge. Unsupervised methods do not require any labelled data, whereas semi-supervised methods leverage some known anomalies. Inspired by mixture-of-experts models and the analysis of the hidden activations of neural networks, we introduce a novel data-driven anomaly detection method called ARGUE. Our method is not only applicable to unsupervised and semi-supervised environments, but also profits from prior knowledge of self-supervised settings. We designed ARGUE as a combination of dedicated expert networks, which specialise on parts of the input data. For its final decision, ARGUE fuses the distributed knowledge across the expert systems using a gated mixture-of-experts architecture. Our evaluation motivates that prior knowledge about the normal data distribution may be as valuable as known anomalies.
    Discovering Inductive Bias with Gibbs Priors: A Diagnostic Tool for Approximate Bayesian Inference. (arXiv:2203.03353v1 [stat.ML])
    Full Bayesian posteriors are rarely analytically tractable, which is why real-world Bayesian inference heavily relies on approximate techniques. Approximations generally differ from the true posterior and require diagnostic tools to assess whether the inference can still be trusted. We investigate a new approach to diagnosing approximate inference: the approximation mismatch is attributed to a change in the inductive bias by treating the approximations as exact and reverse-engineering the corresponding prior. We show that the problem is more complicated than it appears to be at first glance, because the solution generally depends on the observation. By reframing the problem in terms of incompatible conditional distributions we arrive at a natural solution: the Gibbs prior. The resulting diagnostic is based on pseudo-Gibbs sampling, which is widely applicable and easy to implement. We illustrate how the Gibbs prior can be used to discover the inductive bias in a controlled Gaussian setting and for a variety of Bayesian models and approximations.
    On the Last Iterate Convergence of Momentum Methods. (arXiv:2102.07002v2 [cs.LG] UPDATED)
    SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRL-based SGDM when used with adaptive stepsizes. Empirical results are shown as well.
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v2 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain a $\tilde{\mathcal{O}}(dB_*^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_*$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and a $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves a $\tilde{\mathcal{O}}(dB_*\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_* \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.
    Machine learning using longitudinal prescription and medical claims for the detection of nonalcoholic steatohepatitis (NASH). (arXiv:2203.03365v1 [stat.ML])
    Objectives To develop and evaluate machine learning models to detect suspected undiagnosed nonalcoholic steatohepatitis (NASH) patients for diagnostic screening and clinical management. Methods In this retrospective observational noninterventional study using administrative medical claims data from 1,463,089 patients, gradient-boosted decision trees were trained to detect likely NASH patients from an at-risk patient population with a history of obesity, type 2 diabetes mellitus (T2DM), metabolic disorder, or nonalcoholic fatty liver (NAFL). Models were trained to detect likely NASH in all at-risk patients or in the subset without a prior NAFL diagnosis (non-NAFL at-risk patients). Models were trained and validated using retrospective medical claims data and assessed using area under precision recall and receiver operating characteristic curves (AUPRCs, AUROCs). Results The 6-month incidence of NASH in claims data was 1 per 1,437 at-risk patients and 1 per 2,127 non-NAFL at-risk patients. The model trained to detect NASH in all at-risk patients had an AUPRC of 0.0107 (95% CI 0.0104 - 0.011) and an AUROC of 0.84. At 10% recall, model precision was 4.3%, which is 60x above NASH incidence. The model trained to detect NASH in non-NAFL patients had an AUPRC of 0.003 (95% CI 0.0029 - 0.0031) and an AUROC of 0.78. At 10% recall, model precision was 1%, which is 20x above NASH incidence. Conclusion The low incidence of NASH in medical claims data corroborates the pattern of NASH underdiagnosis in clinical practice. Claims-based machine learning could facilitate the detection of probable NASH patients for diagnostic testing and disease management.
    Improved rates for prediction and identification of partially observed linear dynamical systems. (arXiv:2011.10006v3 [cs.LG] UPDATED)
    Identification of a linear time-invariant dynamical system from partial observations is a fundamental problem in control theory. Particularly challenging are systems exhibiting long-term memory. A natural question is how learn such systems with non-asymptotic statistical rates depending on the inherent dimensionality (order) $d$ of the system, rather than on the possibly much larger memory length. We propose an algorithm that given a single trajectory of length $T$ with gaussian observation noise, learns the system with a near-optimal rate of $\widetilde O\left(\sqrt\frac{d}{T}\right)$ in $\mathcal{H}_2$ error, with only logarithmic, rather than polynomial dependence on memory length. We also give bounds under process noise and improved bounds for learning a realization of the system. Our algorithm is based on multi-scale low-rank approximation: SVD applied to Hankel matrices of geometrically increasing sizes. Our analysis relies on careful application of concentration bounds on the Fourier domain -- we give sharper concentration bounds for sample covariance of correlated inputs and for $\mathcal H_\infty$ norm estimation, which may be of independent interest.
    Learning Optimal Conformal Classifiers. (arXiv:2110.09192v2 [cs.LG] UPDATED)
    Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.
    Covariate-Balancing-Aware Interpretable Deep Learning models for Treatment Effect Estimation. (arXiv:2203.03185v1 [stat.ML])
    Estimating treatment effects is of great importance for many biomedical applications with observational data. Particularly, interpretability of the treatment effects is preferable to many biomedical researchers. In this paper, we first give a theoretical analysis and propose an upper bound for the bias of average treatment effect estimation under the strong ignorability assumption. The proposed upper bound consists of two parts: training error for factual outcomes, and the distance between treated and control distributions. We use the Weighted Energy Distance (WED) to measure the distance between two distributions. Motivated by the theoretical analysis, we implement this upper bound as an objective function being minimized by leveraging a novel additive neural network architecture, which combines the expressivity of deep neural network, the interpretability of generalized additive model, the sufficiency of the balancing score for estimation adjustment, and covariate balancing properties of treated and control distributions, for estimating average treatment effects from observational data. Furthermore, we impose a so-called weighted regularization procedure based on non-parametric theory, to obtain some desirable asymptotic properties. The proposed method is illustrated by re-examining the benchmark datasets for causal inference, and it outperforms the state-of-art.
    SurvSet: An open-source time-to-event dataset repository. (arXiv:2203.03094v1 [stat.ML])
    Time-to-event (T2E) analysis is a branch of statistics that models the duration of time it takes for an event to occur. Such events can include outcomes like death, unemployment, or product failure. Most modern machine learning (ML) algorithms, like decision trees and kernel methods, are supported for T2E modelling with data science software (python and R). To complement these developments, SurvSet is the first open-source T2E dataset repository designed for a rapid benchmarking of ML algorithms and statistical methods. The data in SurvSet have been consistently formatted so that a single preprocessing method will work for all datasets. SurvSet currently has 76 datasets which vary in dimensionality, time dependency, and background (the majority of which come from biomedicine). SurvSet is available on PyPI and can be installed with pip install SurvSet. R users can download the data directly from the corresponding git repository.
    Minimax Kernel Machine Learning for a Class of Doubly Robust Functionals with Application to Proximal Causal Inference. (arXiv:2104.02929v3 [stat.ML] UPDATED)
    Robins et al. (2008) introduced a class of influence functions (IFs) which could be used to obtain doubly robust moment functions for the corresponding parameters. However, that class does not include the IF of parameters for which the nuisance functions are solutions to integral equations. Such parameters are particularly important in the field of causal inference, specifically in the recently proposed proximal causal inference framework of Tchetgen Tchetgen et al. (2020), which allows for estimating the causal effect in the presence of latent confounders. In this paper, we first extend the class of Robins et al. to include doubly robust IFs in which the nuisance functions are solutions to integral equations. Then we demonstrate that the double robustness property of these IFs can be leveraged to construct estimating equations for the nuisance functions, which enables us to solve the integral equations without resorting to parametric models. We frame the estimation of the nuisance functions as a minimax optimization problem. We provide convergence rates for the nuisance functions and conditions required for asymptotic linearity of the estimator of the parameter of interest. The experiment results demonstrate that our proposed methodology leads to robust and high-performance estimators for average causal effect in the proximal causal inference framework.
    A Local Method for Identifying Causal Relations under Markov Equivalence. (arXiv:2102.12685v2 [stat.ML] UPDATED)
    Causality is important for designing interpretable and robust methods in artificial intelligence research. We propose a local approach to identify whether a variable is a cause of a given target under the framework of causal graphical models of directed acyclic graphs (DAGs). In general, the causal relation between two variables may not be identifiable from observational data as many causal DAGs encoding different causal relations are Markov equivalent. In this paper, we first introduce a sufficient and necessary graphical condition to check the existence of a causal path from a variable to a target in every Markov equivalent DAG. Next, we provide local criteria for identifying whether a variable is a cause/non-cause of a target based only on the local structure instead of the entire graph. Finally, we propose a local learning algorithm for this causal query via learning the local structure of the variable and some additional statistical independence tests related to the target. Simulation studies show that our local algorithm is efficient and effective, compared with other state-of-art methods.
    One Objective for All Models -- Self-supervised Learning for Topic Models. (arXiv:2203.03539v1 [cs.CL])
    Self-supervised learning has significantly improved the performance of many NLP tasks. In this paper, we highlight a key advantage of self-supervised learning -- when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform competitively against posterior inference using the correct model, while outperforming posterior inference using misspecified model.
    Fast rates for noisy interpolation require rethinking the effects of inductive bias. (arXiv:2203.03597v1 [stat.ML])
    Good generalization performance on high-dimensional data crucially hinges on a simple structure of the ground truth and a corresponding strong inductive bias of the estimator. Even though this intuition is valid for regularized models, in this paper we caution against a strong inductive bias for interpolation in the presence of noise: Our results suggest that, while a stronger inductive bias encourages a simpler structure that is more aligned with the ground truth, it also increases the detrimental effect of noise. Specifically, for both linear regression and classification with a sparse ground truth, we prove that minimum $\ell_p$-norm and maximum $\ell_p$-margin interpolators achieve fast polynomial rates up to order $1/n$ for $p > 1$ compared to a logarithmic rate for $p = 1$. Finally, we provide experimental evidence that this trade-off may also play a crucial role in understanding non-linear interpolating models used in practice.
    Scale Mixtures of Neural Network Gaussian Processes. (arXiv:2107.01408v2 [stat.ML] UPDATED)
    Recent works have revealed that infinitely-wide feed-forward or recurrent neural networks of any architecture correspond to Gaussian processes referred to as Neural Network Gaussian Processes (NNGPs). While these works have extended the class of neural networks converging to Gaussian processes significantly, however, there has been little focus on broadening the class of stochastic processes that such neural networks converge to. In this work, inspired by the scale mixture of Gaussian random variables, we propose the scale mixture of NNGPs for which we introduce a prior distribution on the scale of the last-layer parameters. We show that simply introducing a scale prior on the last-layer parameters can turn infinitely-wide neural networks of any architecture into a richer class of stochastic processes. With certain scale priors, we obtain heavy-tailed stochastic processes, and in the case of inverse gamma priors, we recover Student's $t$ processes. We further analyze the distributions of the neural networks initialized with our prior setting and trained with gradient descents and obtain similar results as for NNGPs. We present a practical posterior-inference algorithm for the scale mixture of NNGPs and empirically demonstrate its usefulness on regression and classification tasks. In particular, we show that in both tasks, the heavy-tailed stochastic processes obtained from our framework are robust to out-of-distribution data.
    Is Bayesian Model-Agnostic Meta Learning Better than Model-Agnostic Meta Learning, Provably?. (arXiv:2203.03059v1 [cs.LG])
    Meta learning aims at learning a model that can quickly adapt to unseen tasks. Widely used meta learning methods include model agnostic meta learning (MAML), implicit MAML, Bayesian MAML. Thanks to its ability of modeling uncertainty, Bayesian MAML often has advantageous empirical performance. However, the theoretical understanding of Bayesian MAML is still limited, especially on questions such as if and when Bayesian MAML has provably better performance than MAML. In this paper, we aim to provide theoretical justifications for Bayesian MAML's advantageous performance by comparing the meta test risks of MAML and Bayesian MAML. In the meta linear regression, under both the distribution agnostic and linear centroid cases, we have established that Bayesian MAML indeed has provably lower meta test risks than MAML. We verify our theoretical results through experiments.
    Estimation and Model Misspecification: Fake and Missing Features. (arXiv:2203.03398v1 [eess.SP])
    We consider estimation under model misspecification where there is a model mismatch between the underlying system, which generates the data, and the model used during estimation. We propose a model misspecification framework which enables a joint treatment of the model misspecification types of having fake and missing features, as well as incorrect covariance assumptions on the unknowns and the noise. Here, features which are included in the model but are not present in the underlying system, and features which are not included in the model but are present in the underlying system, are referred to as fake and missing features, respectively. Under this framework, we characterize the estimation performance and reveal trade-offs between the missing and fake features and the possibly incorrect noise level assumption. In contrast to existing work focusing on incorrect covariance assumptions or missing features, fake features is a central component of our framework. Our results show that fake features can significantly improve the estimation performance, even though they are not correlated with the features in the underlying system. In particular, we show that the estimation error can be decreased by including more fake features in the model, even to the point where the model is overparametrized, i.e., the model contains more unknowns than observations.
    Learning Neural Causal Models with Active Interventions. (arXiv:2109.02429v2 [stat.ML] UPDATED)
    Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far, differentiable causal discovery has focused on static datasets of observational or fixed interventional origin. In this work, we introduce an active intervention targeting (AIT) method which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across multiple frameworks in a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data.
    On the Fairness of Causal Algorithmic Recourse. (arXiv:2010.06529v5 [cs.LG] UPDATED)
    Algorithmic fairness is typically studied from the perspective of predictions. Instead, here we investigate fairness from the perspective of recourse actions suggested to individuals to remedy an unfavourable classification. We propose two new fairness criteria at the group and individual level, which -- unlike prior work on equalising the average group-wise distance from the decision boundary -- explicitly account for causal relationships between features, thereby capturing downstream effects of recourse actions performed in the physical world. We explore how our criteria relate to others, such as counterfactual fairness, and show that fairness of recourse is complementary to fairness of prediction. We study theoretically and empirically how to enforce fair causal recourse by altering the classifier and perform a case study on the Adult dataset. Finally, we discuss whether fairness violations in the data generating process revealed by our criteria may be better addressed by societal interventions as opposed to constraints on the classifier.
    Predicting Bearings' Degradation Stages for Predictive Maintenance in the Pharmaceutical Industry. (arXiv:2203.03259v1 [stat.ML])
    In the pharmaceutical industry, the maintenance of production machines must be audited by the regulator. In this context, the problem of predictive maintenance is not when to maintain a machine, but what parts to maintain at a given point in time. The focus shifts from the entire machine to its component parts and prediction becomes a classification problem. In this paper, we focus on rolling-elements bearings and we propose a framework for predicting their degradation stages automatically. Our main contribution is a k-means bearing lifetime segmentation method based on high-frequency bearing vibration signal embedded in a latent low-dimensional subspace using an AutoEncoder. Given high-frequency vibration data, our framework generates a labeled dataset that is used to train a supervised model for bearing degradation stage detection. Our experimental results, based on the FEMTO Bearing dataset, show that our framework is scalable and that it provides reliable and actionable predictions for a range of different bearings.
    P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model. (arXiv:2006.12101v4 [cs.LG] UPDATED)
    How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately handle high dimensional data. In particular, when the input dataset contains a large number of features, the existing techniques require injecting a prohibitive amount of noise to satisfy differential privacy, which results in the outsourced data analysis meaningless. To address the above issue, this paper proposes privacy-preserving phased generative model (P3GM), which is a differentially private generative model for releasing such sensitive data. P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency (e.g., easy to converge). We give theoretical analyses about the learning complexity and privacy loss in P3GM. We further experimentally evaluate our proposed method and demonstrate that P3GM significantly outperforms existing solutions. Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity. Besides, in several data mining tasks with synthesized data, our model outperforms the competitors in terms of accuracy.
    E-detectors: a nonparametric framework for online changepoint detection. (arXiv:2203.03532v1 [stat.ME])
    Sequential changepoint detection is a classical problem with a variety of applications. However, the majority of prior work has been parametric, for example, focusing on exponential families. We develop a fundamentally new and general framework for changepoint detection when the pre- and post-change distributions are nonparametrically specified (and thus composite). Our procedures come with clean, nonasymptotic bounds on the average run length (frequency of false alarms). In certain nonparametric cases (like sub-Gaussian or sub-exponential), we also provide near-optimal bounds on the detection delay following a changepoint. The primary technical tool that we introduce is called an e-detector, which is composed of sums of e-processes -- a fundamental generalization of nonnegative supermartingales -- that are started at consecutive times. We first introduce simple Shiryaev-Roberts and CUSUM-style e-detectors, and then show how to design their mixtures in order to achieve both statistical and computational efficiency. We demonstrate their efficacy in detecting changes in the mean of a bounded random variable without any i.i.d. assumptions, with an application to tracking the performance of a sports team over multiple seasons.
    Active Inference and Epistemic Value in Graphical Models. (arXiv:2109.00541v2 [stat.ML] UPDATED)
    The Free Energy Principle (FEP) postulates that biological agents perceive and interact with their environment in order to minimize a Variational Free Energy (VFE) with respect to a generative model of their environment. The inference of a policy (future control sequence) according to the FEP is known as Active Inference (AIF). The AIF literature describes multiple VFE objectives for policy planning that lead to epistemic (information-seeking) behavior. However, most objectives have limited modeling flexibility. This paper approaches epistemic behavior from a constrained Bethe Free Energy (CBFE) perspective. Crucially, variational optimization of the CBFE can be expressed in terms of message passing on free-form generative models. The key intuition behind the CBFE is that we impose a point-mass constraint on predicted outcomes, which explicitly encodes the assumption that the agent will make observations in the future. We interpret the CBFE objective in terms of its constituent behavioral drives. We then illustrate resulting behavior of the CBFE by planning and interacting with a simulated T-maze environment. Simulations for the T-maze task illustrate how the CBFE agent exhibits an epistemic drive, and actively plans ahead to account for the impact of predicted outcomes. Compared to an EFE agent, the CBFE agent incurs expected reward in significantly more environmental scenarios. We conclude that CBFE optimization by message passing suggests a general mechanism for epistemic-aware AIF in free-form generative models.
    Scalable Uncertainty Quantification for Deep Operator Networks using Randomized Priors. (arXiv:2203.03048v1 [cs.LG])
    We present a simple and effective approach for posterior uncertainty quantification in deep operator networks (DeepONets); an emerging paradigm for supervised learning in function spaces. We adopt a frequentist approach based on randomized prior ensembles, and put forth an efficient vectorized implementation for fast parallel inference on accelerated hardware. Through a collection of representative examples in computational mechanics and climate modeling, we show that the merits of the proposed approach are fourfold. (1) It can provide more robust and accurate predictions when compared against deterministic DeepONets. (2) It shows great capability in providing reliable uncertainty estimates on scarce data-sets with multi-scale function pairs. (3) It can effectively detect out-of-distribution and adversarial examples. (4) It can seamlessly quantify uncertainty due to model bias, as well as noise corruption in the data. Finally, we provide an optimized JAX library called {\em UQDeepONet} that can accommodate large model architectures, large ensemble sizes, as well as large data-sets with excellent parallel performance on accelerated hardware, thereby enabling uncertainty quantification for DeepONets in realistic large-scale applications.
    Thompson Sampling with a Mixture Prior. (arXiv:2106.05608v2 [cs.LG] UPDATED)
    We study Thompson sampling (TS) in online decision making, where the uncertain environment is sampled from a mixture distribution. This is relevant in multi-task learning, where a learning agent faces different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior, and call the resulting algorithm MixTS. To analyze MixTS, we develop a novel and general proof technique for analyzing the concentration of mixture distributions. We use it to prove Bayes regret bounds for MixTS in both linear bandits and finite-horizon reinforcement learning. Our bounds capture the structure of the prior, depend on the number of mixture components and their widths. We also demonstrate the empirical effectiveness of MixTS in synthetic and real-world experiments.
    Frame Averaging for Invariant and Equivariant Network Design. (arXiv:2110.03336v2 [cs.LG] UPDATED)
    Many machine learning tasks involve learning functions that are known to be invariant or equivariant to certain symmetries of the input data. However, it is often challenging to design neural network architectures that respect these symmetries while being expressive and computationally efficient. For example, Euclidean motion invariant/equivariant graph or point cloud neural networks. We introduce Frame Averaging (FA), a general purpose and systematic framework for adapting known (backbone) architectures to become invariant or equivariant to new symmetry types. Our framework builds on the well known group averaging operator that guarantees invariance or equivariance but is intractable. In contrast, we observe that for many important classes of symmetries, this operator can be replaced with an averaging operator over a small subset of the group elements, called a frame. We show that averaging over a frame guarantees exact invariance or equivariance while often being much simpler to compute than averaging over the entire group. Furthermore, we prove that FA-based models have maximal expressive power in a broad setting and in general preserve the expressive power of their backbone architectures. Using frame averaging, we propose a new class of universal Graph Neural Networks (GNNs), universal Euclidean motion invariant point cloud networks, and Euclidean motion invariant Message Passing (MP) GNNs. We demonstrate the practical effectiveness of FA on several applications including point cloud normal estimation, beyond $2$-WL graph separation, and $n$-body dynamics prediction, achieving state-of-the-art results in all of these benchmarks.
    On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty. (arXiv:2102.11409v3 [cs.LG] UPDATED)
    Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes "far-away" data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL's feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertainty methods, while maintaining the speed and accuracy of standard neural networks.
    Sparse Bayesian Learning with Diagonal Quasi-Newton Method for Large Scale Classification. (arXiv:2107.08195v3 [cs.LG] UPDATED)
    Sparse Bayesian Learning (SBL) constructs an extremely sparse probabilistic model with very competitive generalization. However, SBL needs to invert a big covariance matrix with complexity O(M^3 ) (M: feature size) for updating the regularization priors, making it difficult for practical use. There are three issues in SBL: 1) Inverting the covariance matrix may obtain singular solutions in some cases, which hinders SBL from convergence; 2) Poor scalability to problems with high dimensional feature space or large data size; 3) SBL easily suffers from memory overflow for large-scale data. This paper addresses these issues with a newly proposed diagonal Quasi-Newton (DQN) method for SBL called DQN-SBL where the inversion of big covariance matrix is ignored so that the complexity and memory storage are reduced to O(M). The DQN-SBL is thoroughly evaluated on non-linear classifiers and linear feature selection using various benchmark datasets of different sizes. Experimental results verify that DQN-SBL receives competitive generalization with a very sparse model and scales well to large-scale problems.
    Generative Adversarial Networks for Labeled Data Creation for Structural Monitoring and Damage Detection. (arXiv:2112.03478v3 [cs.LG] UPDATED)
    There has been a drastic progression in the field of Data Science in the last few decades and other disciplines have been continuously benefitting from it. Structural Health Monitoring (SHM) is one of those fields that use Artificial Intelligence (AI) such as Machine Learning (ML) and Deep Learning (DL) algorithms for condition assessment of civil structures based on the collected data. The ML and DL methods require plenty of data for training procedures; however, in SHM, data collection from civil structures is very exhaustive; particularly getting useful data (damage associated data) can be very challenging. The objective of this study is to address the data scarcity problem for damage detection. This paper first presents 1-D Wasserstein Deep Convolutional Generative Adversarial Networks using Gradient Penalty (1-D WDCGAN-GP) for synthetic labelled vibration data generation. Then, implements structural damage detection on different levels of synthetically enhanced raw vibration datasets by using 1-D Deep Convolutional Neural Network (1-D DCNN). The damage detection results show that the 1-D WDCGAN-GP can be successfully utilized to tackle data scarcity in vibration-based damage diagnostics of civil structures. Keywords: Structural Health Monitoring (SHM), Structural Damage Detection, 1-D Deep Convolutional Neural Networks (1-D DCNN), 1-D Generative Adversarial Networks (1-D GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP)
    Weighted-average quantile regression. (arXiv:2203.03032v1 [econ.EM])
    In this paper, we introduce the weighted-average quantile regression framework, $\int_0^1 q_{Y|X}(u)\psi(u)du = X'\beta$, where $Y$ is a dependent variable, $X$ is a vector of covariates, $q_{Y|X}$ is the quantile function of the conditional distribution of $Y$ given $X$, $\psi$ is a weighting function, and $\beta$ is a vector of parameters. We argue that this framework is of interest in many applied settings and develop an estimator of the vector of parameters $\beta$. We show that our estimator is $\sqrt T$-consistent and asymptotically normal with mean zero and easily estimable covariance matrix, where $T$ is the size of available sample. We demonstrate the usefulness of our estimator by applying it in two empirical settings. In the first setting, we focus on financial data and study the factor structures of the expected shortfalls of the industry portfolios. In the second setting, we focus on wage data and study inequality and social welfare dependence on commonly used individual characteristics.
    Random Effect Bandits. (arXiv:2106.12200v2 [cs.LG] UPDATED)
    This paper studies regret minimization in a multi-armed bandit. It is well known that side information, such as the prior distribution of arm means in Thompson sampling, can improve the statistical efficiency of the bandit algorithm. While the prior is a blessing when correctly specified, it is a curse when misspecified. To address this issue, we introduce the assumption of a random-effect model to bandits. In this model, the mean arm rewards are drawn independently from an unknown distribution, which we estimate. We derive a random-effect estimator of the arm means, analyze its uncertainty, and design a UCB algorithm ReUCB that uses it. We analyze ReUCB and derive an upper bound on its $n$-round Bayes regret, which improves upon not using the random-effect structure. Our experiments show that ReUCB can outperform Thompson sampling, without knowing the prior distribution of arm means.
    Switching in the Rain: Predictive Wireless x-haul Network Reconfiguration. (arXiv:2203.03383v1 [cs.NI])
    Wireless x-haul networks rely on microwave and millimeter-wave links between 4G and/or 5G base-stations to support ultra-high data rate and ultra-low latency. A major challenge associated with these high frequency links is their susceptibility to weather conditions. In particular, precipitation may cause severe signal attenuation, which significantly degrades the network performance. In this paper, we develop a Predictive Network Reconfiguration (PNR) framework that uses historical data to predict the future condition of each link and then prepares the network ahead of time for imminent disturbances. The PNR framework has two components: (i) an Attenuation Prediction (AP) mechanism; and (ii) a Multi-Step Network Reconfiguration (MSNR) algorithm. The AP mechanism employs an encoder-decoder Long Short-Term Memory (LSTM) model to predict the sequence of future attenuation levels of each link. The MSNR algorithm leverages these predictions to dynamically optimize routing and admission control decisions aiming to maximize network utilization, while preserving max-min fairness among the base-stations sharing the network and preventing transient congestion that may be caused by re-routing. We train, validate, and evaluate the PNR framework using a dataset containing over 2 million measurements collected from a real-world city-scale backhaul network. The results show that the framework: (i) predicts attenuation with high accuracy, with an RMSE of less than 0.4 dB for a prediction horizon of 50 seconds; and (ii) can improve the instantaneous network utilization by more than 200% when compared to reactive network reconfiguration algorithms that cannot leverage information about future disturbances.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v2 [cs.LG] UPDATED)
    Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Generalized Spectral Clustering for Directed and Undirected Graphs. (arXiv:2203.03221v1 [stat.ML])
    Spectral clustering is a popular approach for clustering undirected graphs, but its extension to directed graphs (digraphs) is much more challenging. A typical workaround is to naively symmetrize the adjacency matrix of the directed graph, which can however lead to discarding valuable information carried by edge directionality. In this paper, we present a generalized spectral clustering framework that can address both directed and undirected graphs. Our approach is based on the spectral relaxation of a new functional that we introduce as the generalized Dirichlet energy of a graph function, with respect to an arbitrary positive regularizing measure on the graph edges. We also propose a practical parametrization of the regularizing measure constructed from the iterated powers of the natural random walk on the graph. We present theoretical arguments to explain the efficiency of our framework in the challenging setting of unbalanced classes. Experiments using directed K-NN graphs constructed from real datasets show that our graph partitioning method performs consistently well in all cases, while outperforming existing approaches in most of them.
    A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits. (arXiv:2201.06532v2 [cs.LG] UPDATED)
    We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
    Deep Dynamic Boosted Forest. (arXiv:1804.07270v4 [cs.LG] UPDATED)
    Random forest is widely exploited as an ensemble learning method. In many practical applications, however, there is still a significant challenge to learn from imbalanced data. To alleviate this limitation, we propose a deep dynamic boosted forest (DDBF), a novel ensemble algorithm that incorporates the notion of hard example mining into random forest. Specically, we propose to measure the quality of each leaf node of every decision tree in the random forest to determine hard examples. By iteratively training and then removing easy examples from training data, we evolve the random forest to focus on hard examples dynamically so as to balance the proportion of samples and learn decision boundaries better. Data can be cascaded through these random forests learned in each iteration in sequence to generate more accurate predictions. Our DDBF outperforms random forest on 5 UCI datasets, MNIST and SATIMAGE, and achieved state-of-the-art results compared to other deep models. Moreover, we show that DDBF is also a new way of sampling and can be very useful and efficient when learning from imbalanced data.
    Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit. (arXiv:2203.03003v1 [cs.LG])
    We introduce a method for pricing consumer credit using recent advances in offline deep reinforcement learning. This approach relies on a static dataset and requires no assumptions on the functional form of demand. Using both real and synthetic data on consumer credit applications, we demonstrate that our approach using the conservative Q-Learning algorithm is capable of learning an effective personalized pricing policy without any online interaction or price experimentation.
    Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning. (arXiv:2203.03110v1 [cs.LG])
    In this paper, we study gap-dependent regret guarantees for risk-sensitive reinforcement learning based on the entropic risk measure. We propose a novel definition of sub-optimality gaps, which we call cascaded gaps, and we discuss their key components that adapt to the underlying structures of the problem. Based on the cascaded gaps, we derive non-asymptotic and logarithmic regret bounds for two model-free algorithms under episodic Markov decision processes. We show that, in appropriate settings, these bounds feature exponential improvement over existing ones that are independent of gaps. We also prove gap-dependent lower bounds, which certify the near optimality of the upper bounds.
    HEAR 2021: Holistic Evaluation of Audio Representations. (arXiv:2203.03022v1 [cs.SD])
    What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.
    Multi-Modal Attribute Extraction for E-Commerce. (arXiv:2203.03441v1 [cs.CV])
    To improve users' experience as they navigate the myriad of options offered by online marketplaces, it is essential to have well-organized product catalogs. One key ingredient to that is the availability of product attributes such as color or material. However, on some marketplaces such as Rakuten-Ichiba, which we focus on, attribute information is often incomplete or even missing. One promising solution to this problem is to rely on deep models pre-trained on large corpora to predict attributes from unstructured data, such as product descriptive texts and images (referred to as modalities in this paper). However, we find that achieving satisfactory performance with this approach is not straightforward but rather the result of several refinements, which we discuss in this paper. We provide a detailed description of our approach to attribute extraction, from investigating strong single-modality methods, to building a solid multimodal model combining textual and visual information. One key component of our multimodal architecture is a novel approach to seamlessly combine modalities, which is inspired by our single-modality investigations. In practice, we notice that this new modality-merging method may suffer from a modality collapse issue, i.e., it neglects one modality. Hence, we further propose a mitigation to this problem based on a principled regularization scheme. Experiments on Rakuten-Ichiba data provide empirical evidence for the benefits of our approach, which has been also successfully deployed to Rakuten-Ichiba. We also report results on publicly available datasets showing that our model is competitive compared to several recent multimodal and unimodal baselines.
    Regularising for invariance to data augmentation improves supervised learning. (arXiv:2203.03304v1 [cs.LG])
    Data augmentation is used in machine learning to make the classifier invariant to label-preserving transformations. Usually this invariance is only encouraged implicitly by including a single augmented input during training. However, several works have recently shown that using multiple augmentations per input can improve generalisation or can be used to incorporate invariances more explicitly. In this work, we first empirically compare these recently proposed objectives that differ in whether they rely on explicit or implicit regularisation and at what level of the predictor they encode the invariances. We show that the predictions of the best performing method are also the most similar when compared on different augmentations of the same input. Inspired by this observation, we propose an explicit regulariser that encourages this invariance on the level of individual model predictions. Through extensive experiments on CIFAR-100 and ImageNet we show that this explicit regulariser (i) improves generalisation and (ii) equalises performance differences between all considered objectives. Our results suggest that objectives that encourage invariance on the level of the neural network itself generalise better than those that achieve invariance by averaging predictions of non-invariant models.  ( 2 min )
    Maillard Sampling: Boltzmann Exploration Done Optimally. (arXiv:2111.03290v2 [stat.ML] UPDATED)
    The PhD thesis of Maillard (2013) presents a rather obscure algorithm for the $K$-armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a \textit{closed form}, which is not true for Thompson sampling, a widely-adopted bandit algorithm in the industry. This means that the bandit-logged data from running MS can be readily used for counterfactual evaluation, unlike Thompson sampling. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and $\sqrt{KT\log{T}}$ minimax regret bound where $T$ is the time horizon, which matches the known bounds for asymptotically optimal UCB. %'s performance. We then propose a variant of MS called MS$^+$ that improves its minimax bound to $\sqrt{KT\log{K}}$. MS$^+$ can also be tuned to be aggressive (i.e., less exploration) without losing the asymptotic optimality, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS$^+$.  ( 2 min )
    Target Network and Truncation Overcome The Deadly triad in $Q$-Learning. (arXiv:2203.02628v1 [cs.LG])
    $Q$-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms, and was identified in Sutton (1999) as one of the most important theoretical open problems in the RL community. Even in the basic linear function approximation setting, there are well-known divergent examples. In this work, we propose a stable design for $Q$-learning with linear function approximation using target network and truncation, and establish its finite-sample guarantees. Our result implies an $\mathcal{O}(\epsilon^{-2})$ sample complexity up to a function approximation error. This is the first variant of $Q$-learning with linear function approximation that is provably stable without requiring strong assumptions or modifying the problem parameters, and achieves the optimal sample complexity.  ( 2 min )
    Privacy Amplification by Decentralization. (arXiv:2012.05326v4 [cs.LG] UPDATED)
    Analyzing data owned by several parties while achieving a good trade-off between utility and privacy is a key challenge in federated learning and analytics. In this work, we introduce a novel relaxation of local differential privacy (LDP) that naturally arises in fully decentralized algorithms, i.e., when participants exchange information by communicating along the edges of a network graph without central coordinator. This relaxation, that we call network DP, captures the fact that users have only a local view of the system. To show the relevance of network DP, we study a decentralized model of computation where a token performs a walk on the network graph and is updated sequentially by the party who receives it. For tasks such as real summation, histogram computation and optimization with gradient descent, we propose simple algorithms on ring and complete topologies. We prove that the privacy-utility trade-offs of our algorithms under network DP significantly improve upon what is achievable under LDP, and often match the utility of the trusted curator model. Our results show for the first time that formal privacy gains can be obtained from full decentralization. We also provide experiments to illustrate the improved utility of our approach for decentralized training with stochastic gradient descent.  ( 2 min )
    A Non-asymptotic Approach to Best-Arm Identification for Gaussian Bandits. (arXiv:2105.12978v2 [math.ST] UPDATED)
    We propose a new strategy for best-arm identification with fixed confidence of Gaussian variables with bounded means and unit variance. This strategy, called Exploration-Biased Sampling, is not only asymptotically optimal: it is to the best of our knowledge the first strategy with non-asymptotic bounds that asymptotically matches the sample complexity.But the main advantage over other algorithms like Track-and-Stop is an improved behavior regarding exploration: Exploration-Biased Sampling is biased towards exploration in a subtle but natural way that makes it more stable and interpretable. These improvements are allowed by a new analysis of the sample complexity optimization problem, which yields a faster numerical resolution scheme and several quantitative regularity results that we believe of high independent interest.  ( 2 min )
    State space partitioning based on constrained spectral clustering for block particle filtering. (arXiv:2203.03475v1 [stat.ML])
    The particle filter (PF) is a powerful inference tool widely used to estimate the filtering distribution in non-linear and/or non-Gaussian problems. To overcome the curse of dimensionality of PF, the block PF (BPF) inserts a blocking step to partition the state space into several subspaces or blocks of smaller dimension so that the correction and resampling steps can be performed independently on each subspace. Using blocks of small size reduces the variance of the filtering distribution estimate, but in turn the correlation between blocks is broken and a bias is introduced. When the dependence relationships between state variables are unknown, it is not obvious to decide how to split the state space into blocks and a significant error overhead may arise from a poor choice of partitioning. In this paper, we formulate the partitioning problem in the BPF as a clustering problem and we propose a state space partitioning method based on spectral clustering (SC). We design a generalized BPF algorithm that contains two new steps: (i) estimation of the state vector correlation matrix from predicted particles, (ii) SC using this estimate as the similarity matrix to determine an appropriate partition. In addition, a constraint is imposed on the maximal cluster size to prevent SC from providing too large blocks. We show that the proposed method can bring together in the same blocks the most correlated state variables while successfully escaping the curse of dimensionality.  ( 2 min )
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v3 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, motivated by mobile health applications, based on combining key aspects of two established online problems with bandit feedback. Both assume that the optimal action to play is contingent on an underlying changing state which is not directly observable by the agent. They differ in what kind of side information can be used to estimate this state. The first one considers each state to be associated with a context, possibly corrupted, allowing the agent to learn the context-to-state mapping. The second one considers that the state itself evolves in a Markovian fashion, thus allowing the agent to estimate the current state based on history. We argue that it is realistic for the agent to have access to both these sources of information, i.e., an arbitrarily corrupted context obeying the Markov property. Thus, the agent is faced with a new challenge of balancing its belief about the reliability of information from learned state transitions versus context information. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Users transition through different unobserved, time-correlated but only partially observable internal states, which determine their current needs. The side-information about users might not always be reliable and standard approaches solely relying on the context risk incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We evaluate our method on simulated data and on several real-world data sets, showing improved empirical performance compared to several popular algorithms.  ( 3 min )
    Robust Modeling of Unknown Dynamical Systems via Ensemble Averaged Learning. (arXiv:2203.03458v1 [cs.LG])
    Recent work has focused on data-driven learning of the evolution of unknown systems via deep neural networks (DNNs), with the goal of conducting long time prediction of the evolution of the unknown system. Training a DNN with low generalization error is a particularly important task in this case as error is accumulated over time. Because of the inherent randomness in DNN training, chiefly in stochastic optimization, there is uncertainty in the resulting prediction, and therefore in the generalization error. Hence, the generalization error can be viewed as a random variable with some probability distribution. Well-trained DNNs, particularly those with many hyperparameters, typically result in probability distributions for generalization error with low bias but high variance. High variance causes variability and unpredictably in the results of a trained DNN. This paper presents a computational technique which decreases the variance of the generalization error, thereby improving the reliability of the DNN model to generalize consistently. In the proposed ensemble averaging method, multiple models are independently trained and model predictions are averaged at each time step. A mathematical foundation for the method is presented, including results regarding the distribution of the local truncation error. In addition, three time-dependent differential equation problems are considered as numerical examples, demonstrating the effectiveness of the method to decrease variance of DNN predictions generally.  ( 2 min )
    Flower: A Friendly Federated Learning Research Framework. (arXiv:2007.14390v5 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the data in the cloud. However, FL is difficult to implement realistically, both in terms of scale and systems heterogeneity. Although there are a number of research frameworks available to simulate FL algorithms, they do not support the study of scalable FL workloads on heterogeneous edge devices. In this paper, we present Flower -- a comprehensive FL framework that distinguishes itself from existing platforms by offering new facilities to execute large-scale FL experiments and consider richly heterogeneous FL device scenarios. Our experiments show Flower can perform FL experiments up to 15M in client size using only a pair of high-end GPUs. Researchers can then seamlessly migrate experiments to real devices to examine other parts of the design space. We believe Flower provides the community with a critical new tool for FL study and development.  ( 2 min )
    On the importance of stationarity, strong baselines and benchmarks in transport prediction problems. (arXiv:2203.02954v1 [stat.ML])
    Over the last years, the transportation community has witnessed a tremendous amount of research contributions on new deep learning approaches for spatio-temporal forecasting. These contributions tend to emphasize the modeling of spatial correlations, while neglecting the fairly stable and recurrent nature of human mobility patterns. In this short paper, we show that a naive baseline method based on the average weekly pattern and linear regression can achieve comparable results to many state-of-the-art deep learning approaches for spatio-temporal forecasting in transportation, or even outperform them on several datasets, thus contrasting the importance of stationarity and recurrent patterns in the data with the importance of spatial correlations. Furthermore, we establish 9 different reference benchmarks that can be used to compare new approaches for spatio-temporal forecasting, and provide a discussion on best practices and the direction that the field is taking.  ( 2 min )
    Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics in Limit-Order Book Markets. (arXiv:2203.03613v1 [econ.EM])
    The prediction of financial markets is a challenging yet important task. In modern electronically-driven markets traditional time-series econometric methods often appear incapable of capturing the true complexity of the multi-level interactions driving the price dynamics. While recent research has established the effectiveness of traditional machine learning (ML) models in financial applications, their intrinsic inability in dealing with uncertainties, which is a great concern in econometrics research and real business applications, constitutes a major drawback. Bayesian methods naturally appear as a suitable remedy conveying the predictive ability of ML methods with the probabilistically-oriented practice of econometric research. By adopting a state-of-the-art second-order optimization algorithm, we train a Bayesian bilinear neural network with temporal attention, suitable for the challenging time-series task of predicting mid-price movements in ultra-high-frequency limit-order book markets. By addressing the use of predictive distributions to analyze errors and uncertainties associated with the estimated parameters and model forecasts, we thoroughly compare our Bayesian model with traditional ML alternatives. Our results underline the feasibility of the Bayesian deep learning approach and its predictive and decisional advantages in complex econometric tasks, prompting future research in this direction.  ( 2 min )
    Equivalence between algorithmic instability and transition to replica symmetry breaking in perceptron learning systems. (arXiv:2111.13302v2 [cond-mat.dis-nn] UPDATED)
    Binary perceptron is a fundamental model of supervised learning for the non-convex optimization, which is a root of the popular deep learning. Binary perceptron is able to achieve a classification of random high-dimensional data by computing the marginal probabilities of binary synapses. The relationship between the algorithmic instability and the equilibrium analysis of the model remains elusive. Here, we establish the relationship by showing that the instability condition around the algorithmic fixed point is identical to the instability for breaking the replica symmetric saddle point solution of the free energy function. Therefore, our analysis would hopefully provide insights towards other learning systems in bridging the gap between non-convex learning dynamics and statistical mechanics properties of more complex neural networks.  ( 2 min )
    Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. (arXiv:2203.02592v1 [stat.ML])
    The information bottleneck framework provides a systematic approach to learn representations that compress nuisance information in inputs and extract semantically meaningful information about the predictions. However, the choice of the prior distribution that fix the dimensionality across all the data can restrict the flexibility of this approach to learn robust representations. We present a novel sparsity-inducing spike-slab prior that uses sparsity as a mechanism to provide flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism to learn a joint distribution of the latent variable and the sparsity. Thus, unlike other approaches, it can account for the full uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST and Fashion-MNIST data we show that the proposed approach improves the accuracy and robustness compared with the traditional fixed -imensional priors as well as other sparsity-induction mechanisms proposed in the literature.  ( 2 min )
    Ridges, Neural Networks, and the Radon Transform. (arXiv:2203.02543v1 [math.FA])
    A ridge is a function that is characterized by a one-dimensional profile (activation) and a multidimensional direction vector. Ridges appear in the theory of neural networks as functional descriptors of the effect of a neuron, with the direction vector being encoded in the linear weights. In this paper, we investigate properties of the Radon transform in relation to ridges and to the characterization of neural networks. We introduce a broad category of hyper-spherical Banach subspaces (including the relevant subspace of measures) over which the back-projection operator is invertible. We also give conditions under which the back-projection operator is extendable to the full parent space with its null space being identifiable as a Banach complement. Starting from first principles, we then characterize the sampling functionals that are in the range of the filtered Radon transform. Next, we extend the definition of ridges for any distributional profile and determine their (filtered) Radon transform in full generality. Finally, we apply our formalism to clarify and simplify some of the results and proofs on the optimality of ReLU networks that have appeared in the literature.  ( 2 min )
    Koopman operator for time-dependent reliability analysis. (arXiv:2203.02658v1 [stat.ML])
    Time-dependent structural reliability analysis of nonlinear dynamical systems is non-trivial; subsequently, scope of most of the structural reliability analysis methods is limited to time-independent reliability analysis only. In this work, we propose a Koopman operator based approach for time-dependent reliability analysis of nonlinear dynamical systems. Since the Koopman representations can transform any nonlinear dynamical system into a linear dynamical system, the time evolution of dynamical systems can be obtained by Koopman operators seamlessly regardless of the nonlinear or chaotic behavior. Despite the fact that the Koopman theory has been in vogue a long time back, identifying intrinsic coordinates is a challenging task; to address this, we propose an end-to-end deep learning architecture that learns the Koopman observables and then use it for time marching the dynamical response. Unlike purely data-driven approaches, the proposed approach is robust even in the presence of uncertainties; this renders the proposed approach suitable for time-dependent reliability analysis. We propose two architectures; one suitable for time-dependent reliability analysis when the system is subjected to random initial condition and the other suitable when the underlying system have uncertainties in system parameters. The proposed approach is robust and generalizes to unseen environment (out-of-distribution prediction). Efficacy of the proposed approached is illustrated using three numerical examples. Results obtained indicate supremacy of the proposed approach as compared to purely data-driven auto-regressive neural network and long-short term memory network.  ( 2 min )
    Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions. (arXiv:2203.02605v1 [stat.ML])
    Reinforcement learning (RL) is acquiring a key role in the space of adaptive interventions (AIs), attracting a substantial interest within methodological and theoretical literature and becoming increasingly popular within health sciences. Despite potential benefits, its application in real life is still limited due to several operational and statistical challenges--in addition to ethical and cost issues among others--that remain open in part due to poor communication and synergy between methodological and applied scientists. In this work, we aim to bridge the different domains that contribute to and may benefit from RL, under a unique framework that intersects the areas of RL, causal inference, and AIs, among others. We provide the first unified instructive survey on RL methods for building AIs, encompassing both dynamic treatment regimes (DTRs) and just-in-time adaptive interventions in mobile health (mHealth). We outline similarities and differences between the two areas, and discuss their implications for using RL. We combine our relevant methodological knowledge with motivating studies in both DTRs and mHealth to illustrate the tremendous collaboration opportunities between statistical, RL, and healthcare researchers in the space of AIs.  ( 2 min )
    Fully Decentralized, Scalable Gaussian Processes for Multi-Agent Federated Learning. (arXiv:2203.02865v1 [stat.ML])
    In this paper, we propose decentralized and scalable algorithms for Gaussian process (GP) training and prediction in multi-agent systems. To decentralize the implementation of GP training optimization algorithms, we employ the alternating direction method of multipliers (ADMM). A closed-form solution of the decentralized proximal ADMM is provided for the case of GP hyper-parameter training with maximum likelihood estimation. Multiple aggregation techniques for GP prediction are decentralized with the use of iterative and consensus methods. In addition, we propose a covariance-based nearest neighbor selection strategy that enables a subset of agents to perform predictions. The efficacy of the proposed methods is illustrated with numerical experiments on synthetic and real data.  ( 2 min )
    Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization. (arXiv:2203.02839v1 [cs.LG])
    We study the asymmetric matrix factorization problem under a natural nonconvex formulation with arbitrary overparamatrization. We consider the model-free setting with no further assumption on the rank or singular values of the observed matrix, where the global optima provably overfit. We show that vanilla gradient descent with small random initialization and early stopping produces the best low-rank approximation of the observed matrix, without any additional regularization. We provide a sharp analysis on relationship between the iteration complexity, initialization size, stepsize and final error. In particular, our complexity bound is almost dimension-free and depends logarithmically on the final error, and our results have lenient requirements on the stepsize and initialization. Our bounds improve upon existing work and show good agreement with numerical experiments.  ( 2 min )
    Diffusion Maps : Using the Semigroup Property for Parameter Tuning. (arXiv:2203.02867v1 [stat.ML])
    Diffusion maps (DM) constitute a classic dimension reduction technique, for data lying on or close to a (relatively) low-dimensional manifold embedded in a much larger dimensional space. The DM procedure consists in constructing a spectral parametrization for the manifold from simulated random walks or diffusion paths on the data set. However, DM is hard to tune in practice. In particular, the task to set a diffusion time t when constructing the diffusion kernel matrix is critical. We address this problem by using the semigroup property of the diffusion operator. We propose a semigroup criterion for picking t. Experiments show that this principled approach is effective and robust.  ( 2 min )
    Distributional Hardness Against Preconditioned Lasso via Erasure-Robust Designs. (arXiv:2203.02824v1 [cs.DS])
    Sparse linear regression with ill-conditioned Gaussian random designs is widely believed to exhibit a statistical/computational gap, but there is surprisingly little formal evidence for this belief, even in the form of examples that are hard for restricted classes of algorithms. Recent work has shown that, for certain covariance matrices, the broad class of Preconditioned Lasso programs provably cannot succeed on polylogarithmically sparse signals with a sublinear number of samples. However, this lower bound only shows that for every preconditioner, there exists at least one signal that it fails to recover successfully. This leaves open the possibility that, for example, trying multiple different preconditioners solves every sparse linear regression problem. In this work, we prove a stronger lower bound that overcomes this issue. For an appropriate covariance matrix, we construct a single signal distribution on which any invertibly-preconditioned Lasso program fails with high probability, unless it receives a linear number of samples. Surprisingly, at the heart of our lower bound is a new positive result in compressed sensing. We show that standard sparse random designs are with high probability robust to adversarial measurement erasures, in the sense that if $b$ measurements are erased, then all but $O(b)$ of the coordinates of the signal are still information-theoretically identifiable. To our knowledge, this is the first time that partial recoverability of arbitrary sparse signals under erasures has been studied in compressed sensing.  ( 2 min )
    Compartmental Models for COVID-19 and Control via Policy Interventions. (arXiv:2203.02860v1 [stat.ML])
    We demonstrate an approach to replicate and forecast the spread of the SARS-CoV-2 (COVID-19) pandemic using the toolkit of probabilistic programming languages (PPLs). Our goal is to study the impact of various modeling assumptions and motivate policy interventions enacted to limit the spread of infectious diseases. Using existing compartmental models we show how to use inference in PPLs to obtain posterior estimates for disease parameters. We improve popular existing models to reflect practical considerations such as the under-reporting of the true number of COVID-19 cases and motivate the need to model policy interventions for real-world data. We design an SEI3RD model as a reusable template and demonstrate its flexibility in comparison to other models. We also provide a greedy algorithm that selects the optimal series of policy interventions that are likely to control the infected population subject to provided constraints. We work within a simple, modular, and reproducible framework to enable immediate cross-domain access to the state-of-the-art in probabilistic inference with emphasis on policy interventions. We are not epidemiologists; the sole aim of this study is to serve as an exposition of methods, not to directly infer the real-world impact of policy-making for COVID-19.  ( 2 min )
    Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election. (arXiv:2203.02818v1 [cs.LG])
    An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election.  ( 2 min )
    False clustering rate in mixture models. (arXiv:2203.02597v1 [math.ST])
    The clustering task consists in delivering labels to the members of a sample. For most data sets, some individuals are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous. To overcome this difficulty, the idea followed here is to classify only a part of the sample in order to obtain a small misclassification rate. This approach is well known in the supervised setting, and referred to as classification with an abstention option. The purpose of this paper is to revisit this approach in an unsupervised mixture-model framework. The problem is formalized in terms of controlling the false clustering rate (FCR) below a prescribed level {\alpha}, while maximizing the number of classified items. New procedures are introduced and their behavior is shown to be close to the optimal one by establishing theoretical results and conducting numerical experiments. An application to breast cancer data illustrates the benefits of the new approach from a practical viewpoint.  ( 2 min )
    Variable Selection with the Knockoffs: Composite Null Hypotheses. (arXiv:2203.02849v1 [math.ST])
    The Fixed-X knockoff filter is a flexible framework for variable selection with false discovery rate (FDR) control in linear models with arbitrary (non-singular) design matrices and it allows for finite-sample selective inference via the LASSO estimates. In this paper, we extend the theory of the knockoff procedure to tests with composite null hypotheses, which are usually more relevant to real-world problems. The main technical challenge lies in handling composite nulls in tandem with dependent features from arbitrary designs. We develop two methods for composite inference with the knockoffs, namely, shifted ordinary least-squares (S-OLS) and feature-response product perturbation (FRPP), building on new structural properties of test statistics under composite nulls. We also propose two heuristic variants of the S-OLS method that outperform the celebrated Benjamini-Hochberg (BH) procedure for composite nulls, which serves as a heuristic baseline under dependent test statistics. Finally, we analyze the loss in FDR when the original knockoff procedure is naively applied on composite tests.  ( 2 min )
    Information Compensation for Deep Conditional Generative Networks. (arXiv:2001.08559v3 [cs.LG] UPDATED)
    In recent years, unsupervised/weakly-supervised conditional generative adversarial networks (GANs) have achieved many successes on the task of modeling and generating data. However, one of their weaknesses lies in their poor ability to separate, or disentangle, the different factors that characterize the representation encoded in their latent space. To address this issue, we propose a novel structure for unsupervised conditional GANs powered by a novel Information Compensation Connection (IC-Connection). The proposed IC-Connection enables GANs to compensate for information loss incurred during deconvolution operations. In addition, to quantify the degree of disentanglement on both discrete and continuous latent variables, we design a novel evaluation procedure. Our empirical results suggest that our method achieves better disentanglement compared to the state-of-the-art GANs in a conditional generation setting.  ( 2 min )
    Hierarchically Structured Scheduling and Execution of Tasks in a Multi-Agent Environment. (arXiv:2203.03021v1 [cs.LG])
    In a warehouse environment, tasks appear dynamically. Consequently, a task management system that matches them with the workforce too early (e.g., weeks in advance) is necessarily sub-optimal. Also, the rapidly increasing size of the action space of such a system consists of a significant problem for traditional schedulers. Reinforcement learning, however, is suited to deal with issues requiring making sequential decisions towards a long-term, often remote, goal. In this work, we set ourselves on a problem that presents itself with a hierarchical structure: the task-scheduling, by a centralised agent, in a dynamic warehouse multi-agent environment and the execution of one such schedule, by decentralised agents with only partial observability thereof. We propose to use deep reinforcement learning to solve both the high-level scheduling problem and the low-level multi-agent problem of schedule execution. Finally, we also conceive the case where centralisation is impossible at test time and workers must learn how to cooperate in executing the tasks in an environment with no schedule and only partial observability.  ( 2 min )
    Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Mat\'ern Correlations. (arXiv:2203.03116v1 [stat.ML])
    We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Mat\'ern correlations whose smoothness parameter $\nu$ is a half-integer. The proposed algorithm only requires $\mathcal{O}(\nu^3 n)$ operations and $\mathcal{O}(\nu n)$ storage. This leads to a linear-cost solver since $\nu$ is chosen to be fixed and usually very small in most applications. The proposed method can be applied to multi-dimensional problems if a full grid or a sparse grid design is used. The proposed method is based on a novel theory for Mat\'ern correlation functions. We find that a suitable rearrangement of these correlation functions can produce a compactly supported function, called a "kernel packet". Using a set of kernel packets as basis functions leads to a sparse representation of the covariance matrix that results in the proposed algorithm. Simulation studies show that the proposed algorithm, when applicable, is significantly superior to the existing alternatives in both the computational time and predictive accuracy.  ( 2 min )
    Asymptotic in a class of network models with an increasing sub-Gamma degree sequence. (arXiv:2111.01301v2 [math.ST] UPDATED)
    For the differential privacy under the sub-Gamma noise, we derive the asymptotic properties of a class of network models with binary values with general link function. In this paper, we release the degree sequences of the binary networks under a general noisy mechanism with the discrete Laplace mechanism as a special case. We establish the asymptotic result including both consistency and asymptotically normality of the parameter estimator when the number of parameters goes to infinity in a class of network models. Simulations and a real data example are provided to illustrate asymptotic results.  ( 2 min )
    Sequential change-point detection for mutually exciting point processes over networks. (arXiv:2102.05724v2 [stat.ML] UPDATED)
    We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abrupt changes in Hawkes networks arises from various applications, including neuronal imaging, sensor network, and social network monitoring. Despite this, there has not been a computationally and memory-efficient online algorithm for detecting such changes from sequential data. We present an efficient online recursive implementation of the CUSUM statistic for Hawkes processes, both decentralized and memory-efficient, and establish the theoretical properties of this new CUSUM procedure. We then show that the proposed CUSUM method achieves better performance than existing methods, including the Shewhart procedure based on count data, the generalized likelihood ratio (GLR) in the existing literature, and the standard score statistic. We demonstrate this via a simulated example and an application to population code change-detection in neuronal networks.  ( 2 min )
    Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime. (arXiv:2203.03159v1 [cs.LG])
    Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.  ( 2 min )

  • Open

    [D] experience with people falsifying results
    What has been your guys and gals experience with papers having falsified results? Do you consider results that aren’t reproducible falsified? submitted by /u/Academic-Sprinkles77 [link] [comments]
    [D] Use of RL in autonomous vehicles
    Do autonomous vehicle systems use RL for the actual steering of the vehicle? I would think that you could use "hard-coded logic"/traditional control systems while using outputs of different models (Ex. Object detection, lane detection, etc.) as constraints. submitted by /u/answersareallyouneed [link] [comments]  ( 1 min )
    [P] Microsoft releases SynapseML v0.9.5 with support for speech synthesis, anomaly detection, and geospatial analytics on large-scale data
    Thanks to the community for their contributions to the library and please let us know what you think! Link to Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v0.9.5 Link to Paper: https://arxiv.org/abs/2009.08044 https://preview.redd.it/v1o8kqpzr7m81.jpg?width=4125&format=pjpg&auto=webp&s=82d30dacfc85c11c4b405c9217b72d8af0d994dd submitted by /u/mhamilton723 [link] [comments]  ( 1 min )
    [D] Do We Really Need Deep Learning Models for Time Series Forecasting?
    In a comparative study, a team from University of Hildesheim Germany demonstrated that simple GBRT model (Gradient Boosting Regression Tree) with appropriate features engineering outperform all state-of-the-art DNN models evaluated on 9 Time Series Forecasting tasks. submitted by /u/ClaudeCoulombe [link] [comments]  ( 1 min )
    [D] Multi-Agent RL meets Sociology: Why "silly rules" exist in society - Spurious normativity enhances learning of compliance and enforcement behavior in artificial agents (Paper Explained & Author Interview)
    https://youtu.be/6dvcYx9hcbE This is an in-depth paper review, followed by an interview with the papers' authors! Society is ruled by norms, and most of these norms are very useful, such as washing your hands before cooking. However, there also exist plenty of social norms which are essentially arbitrary, such as what hairstyles are acceptable, or what words are rude. These are called "silly rules". This paper uses multi-agent reinforcement learning to investigate why such silly rules exist. Their results indicate a plausible mechanism, by which the existence of silly rules drastically speeds up the agents' acquisition of the skill of enforcing rules, which generalizes well, and therefore a society that has silly rules will be better at enforcing rules in general, leading to faster adapt…  ( 2 min )
    [D] What do data science teams use for ML projects? Multiple reserved instances or Azure Machine Learning like online services?
    Hi all, Currently my team (20 employees) uses around 10 VMs (mostly GPU instances) to carry out our data science projects. People use whichever VM is free and most of the time it feels like they are idle and at the same time decreasing the number of instances didn't work out at peak work load. As you can imagine it is completely chaotic and there is lot of unwanted code lying around. We could shutdown VMs which are not in use but it is not practical as they are shared by many. To streamline things and to reduce VM usage costs we are exploring services like Azure Machine Learning (we use Azure in our company, hence the preference). Are there any better ways to do it? What are the best practices that companies follow to handle Data science projects? Goal: Better project management, reduce costs submitted by /u/grassshopperrr [link] [comments]  ( 2 min )
    [R] Neural Differential Equations for Climate Model Parameterizations
    submitted by /u/SleekEagle [link] [comments]
    [Research] For handling the missing data, can scaling the features be enough for machine learning?
    I am working on a real-life dataset where there are many times feature values are missing. I kept them to be zeros and filled up my feature matrix. In the literature, many researchers are removing the columns (features) where x% of the data is missing. Instead of doing a thresholding like that, I decided to let machine learning decide whether I should delete those features or not. When I give the raw valued matrix to scikit-learn's logistic regression and get feature importance scores, order them and get the classification results progressively, I saw that in the ordered list of features, top X number of them were useful, while the rest decreased the performance. The result looked like this: ​ Raw matrix result Next, I applied standard scaling method in scikit-learn on the raw matrix and followed the same methodology as above to measure how sorted/top features would affect the performance, and saw that the logistic regression didn't mind the missing data anymore, since they are filled during the standard scaling operation (mean values of the features are reduced to zero and variance became unit per feature). Result looked like this: ​ Scaled matrix result The second plot looks like the logistic regression doesn't mind having features with lots of missing data anymore. At least, it is not affected by them in a good or a bad way. In that case, can I just go ahead with the second method? To handle the existence of missing data, can I just go ahead with the scaled result and move on? I'm trying to make sure I'm not committing a data science crime with my data, and that my machine learning results will be legitimate. I appreciate all comments! submitted by /u/AcademicAlien [link] [comments]  ( 3 min )
    [P] Using one-hot encoding as input for neural network
    Im representing a point on a board game as either black, white, or empty. The first thing i did was one-hot encode it. However how would i pass a vector to each input node. So if theres 12 points i would need to pass 12 vectors. submitted by /u/iceking8113 [link] [comments]  ( 1 min )
    [D] In your experience, what is the result if using object detection for image classification ?
    I use an object detection model A that has the same backbone as an image classification model B (eg ResNet50). What is the result if the classification task is compared between the two models? submitted by /u/huuthieu [link] [comments]
    [D] Does exist Machine Learning model for finding the best filter combination in sheets?
    My sheet contains companies with their data (feature1, feature2,…return). I would like program to try diferente filter combinations (on features 1-3) and find a filter combination which filters only companies with the highest return (last column). Because data in features are in some way related to company return data. However some of these features are more important and some of them are less. The combinations may look like this: -Show only rows which feature1 is grater then 3 and feature2 lower then 0. Or like this: -Show only rows which feature1 is lower then -5 and feature2 is grater then 7 and feature3 is grater then 10. The image is just an example, If possible I would like to try this on my original sheet with 1000 rows and 10 features. Thank you for respond submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 1 min )
  • Open

    Exploring the World of Algorithmic Trading
    Algorithmic trading can be fun and rewarding. To the unaware, it refers to trading based on pre-programmed instructions instead of human sentiment. The idea is to leverage computers’ superior speed and analytical abilities relative to humans. Algorithmic trading has gained a lot of popularity with retail and institutional traders. It’s responsible for between 60-73% of… Read More »Exploring the World of Algorithmic Trading The post Exploring the World of Algorithmic Trading appeared first on Data Science Central.  ( 6 min )
    New $1 Million Biennial Prize to Revitalize Statistics
    While nowadays everyone talk about machine learning, data science and AI, few are mentioning statistical science. Its association, The American Statistical Association, founded in 1839, is the second oldest continuously operating professional society in the US according to Wikipedia.  The growth of this community was still strong a few decades ago. A Bit of History… Read More »New $1 Million Biennial Prize to Revitalize Statistics The post New $1 Million Biennial Prize to Revitalize Statistics appeared first on Data Science Central.  ( 4 min )
  • Open

    I made four 3D videos using AI, which is best?
    Video #1 Video #2 Video #3 Video #4 View Poll submitted by /u/Recent_Coffee_2551 [link] [comments]
    A.I. Advances in Treatment Of Spinal Cord Injuries and Surgery
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    AI paradigms used in Westworld's Robots
    Hi all , for those of you who have watched Westworld, what do you think is the closest of today's AI paradigms to achieve such human-like intelligence, more precisely consciousness ? submitted by /u/souhaielbensalem [link] [comments]
    AI musicalchemy
    Can somebody please let me know how I would go about using AI to take two preexisting mp3 tracks and generate a novel mp3 track based on the two inputs? Thanks submitted by /u/SensibleInterlocutor [link] [comments]
    Salesforce AI Presents BLIP
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    NHS creates blueprint for testing bias in AI models using COVID-19 chest imaging data
    submitted by /u/doctanonymous [link] [comments]
    7 Best Natural Language Processing Courses (2022)
    submitted by /u/sivasiriyapureddy [link] [comments]
    ARTIFICIAL INTELLIGENCE ACT PROPOSAEUR-Lex - 52021PC0206 - EN
    submitted by /u/tortadinuvole [link] [comments]
    Chatbot data collection for reporting purposes.
    So we are working on a MS Teams Chatbot project and we'd like to implement a functionality that would enable the end-user to generate a report like the list of the most frequently asked questions. I have zero technical background about this and I want to know how a chatbot would be able to collect that data. Any help is appreciated! submitted by /u/NormieInTheMaking [link] [comments]  ( 1 min )
    FREE COURSE - Machine Learning COMPLETE ONLINE BOOTCAMP with PROJECTS
    I have been working on a video series that uses Python to build a variety of cool projects in Machine Learning using just Python and recently started a tutorial series on Python. I would love to have constructive feedback in order to improvise on any particular front that you want to suggest. Some of the features of both the series are these: Linear Regression Project using Python (we work with a dataset) Implementation of Multiple Linear Regression using Gradient Descent Algorithm (Working with a dataset) Intuition and Conceptual Videos As a pre-requisite, I have posted some Python Tutorial Series (both are in progress and ongoing series) and tons more... Here are some slides: ​ https://preview.redd.it/r77h1pjy95m81.png?width=1280&format=png&auto=webp&s=2a2db5af6bca55bcb67ac85e8349e3d98fbd4167 ​ https://preview.redd.it/ebvqu3cz95m81.png?width=1280&format=png&auto=webp&s=b0c4af44610a33e9a97d17392acd1d097a669eb7 ​ https://preview.redd.it/c7n1x8h0a5m81.png?width=1280&format=png&auto=webp&s=a293b53be92e66586eb04c2979bad4e36ea8022d ​ https://preview.redd.it/hgxbbq91a5m81.png?width=1280&format=png&auto=webp&s=a1f3b436e57e90ed7f75476ed5f91283e6e9794e ​ https://preview.redd.it/8e8hlf22a5m81.png?width=1280&format=png&auto=webp&s=05ec2ab3a828aa652df1a7973d4cc09f271e1cbc This is what we will be covering from absolute scratch in the ongoing series. I have added some videos already (12+) so that would be enough for you to know how the content is. I have already put up around 40+ videos on Python and more than 17 videos on Machine Learning in the respective YouTube Playlists : Python Tutorials with Projects & Machine Learning Tutorials with Projects and will be uploading more content on a regular basis soon. submitted by /u/TheNerdyDevYT [link] [comments]  ( 1 min )
    Hey Folks, Here is really a cool research update from CMU researchers where they open-sourced ‘PolyCoder’, its machine learning-based code generator with 2.7B parameters
    Researchers from Carnegie Mellon University recently published a paper that compares existing code models – Codex, GPT-J, GPT-Neo, GPT-NeoX, and CodeParrot – across programming languages. By comparing and contrasting various models, they want to offer more light on the landscape of code modeling design decisions, as well as fill in a major gap: no big open-source language model has been trained purely on code from several programming languages. Under the umbrella name “PolyCoder,” the team proposes three such models with parameters ranging from 160M to 2.7B. You can continue reading my summary on this research here Paper: https://arxiv.org/pdf/2202.13169.pdf Github: https://github.com/vhellendoorn/code-lms https://preview.redd.it/yb9704dvg4m81.png?width=1168&format=png&auto=webp&s=9f33eee725cb9bfeea90df6bedb759ffaec2b55e submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    3D infinite forests and mountains video
    submitted by /u/Recent_Coffee_2551 [link] [comments]
    Artificial Nightmares : Krusty The Clown || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    AI Generated Music
    submitted by /u/Recent_Coffee_2551 [link] [comments]  ( 1 min )
    Weekly Metaverse Digest #1: virtual concerts, metaverse HQ for Vice Media, spending 100 days in the…
    submitted by /u/bent_out_of_shape_ [link] [comments]
    Any free text-to-image AI Generator APIs like Hotpot.ai?
    I'm a beginner programmer and was interested in using something like a text-to-image ai in my app, but seems most sites require paying to get an api key? If anyone knows some free ai generator APIs, please share, thanks! submitted by /u/polkadottedcloud [link] [comments]
    NeuraQuarium - Neural Network Evolution Sim
    submitted by /u/urocyon_dev [link] [comments]
    What if AI were to balance the ecosystem if humans were extinct? Would the AI go their way to make proper grown lab meat/vegan food to prevent animals from harming each other? How would AI merge with the environment to create biotechnology powerful enough to maintain the environment?
    I wonder how the AI would maintain the environment if humans were extinct. Would humanity be wiped out beneficial for the AI to take over the world? I wonder how much progress the AI would make to create a lab grown meat/vegan food to sustain the environment and to prevent animals from killing each other. To do so, more androids and drones have to be dispatched in the seas, sky, and land to make sure that animals would not kill each other. The question is, how much beneficial they would do if they merge with the environment to advance biotechnology powerful enough to create a new kind of technology that would harmonize with the environment. If humans do coexist, would they have to leave the entire Earth alone for Earth to prosper under the AI's care? Humans may have to live in colonies in such case and realistically would never come back to Earth again. submitted by /u/RebellionOfHell [link] [comments]  ( 1 min )
    hello. At this point can we replace human management with A.I for example in field of economy and justice system?
    submitted by /u/Amir-Iran [link] [comments]  ( 1 min )
  • Open

    Robust Graph Neural Networks
    Posted by Bryan Perozzi, Research Scientist and Qi Zhu, Research Intern, Google Research Graph Neural Networks (GNNs) are powerful tools for leveraging graph-structured data in machine learning. Graphs are flexible data structures that can model many different kinds of relationships and have been used in diverse applications like traffic prediction, rumor and fake news detection, modeling disease spread, and understanding why molecules smell. Graphs can model the relationships between many different types of data, including web pages (left), social connections (center), or molecules (right). As is standard in machine learning (ML), GNNs assume that training samples are selected uniformly at random (i.e., are an independent and identically distributed or “IID” sample). This i…  ( 7 min )
  • Open

    Over-estimation bias for Q-function
    In The QLBS Learner Goes NuQLear: Fitted Q Iteration, Inverse RL, and Option Portfolios and Fitted Q-Iteration at Work , how would derivative of equation (30) with respect to a* imply the use of same dataset and thus Jensen's inequality ? There is a missing link for a mathematical intuition/proof in this case. https://preview.redd.it/civ8tuvn47m81.png?width=800&format=png&auto=webp&s=901670a875541c35969e7b139f378926f3a8c145 https://preview.redd.it/bqpj4l1p47m81.png?width=959&format=png&auto=webp&s=fb5abdec6ea26e2ca35b4c36ef3c15506a174981 https://preview.redd.it/d4v9japp47m81.png?width=847&format=png&auto=webp&s=7c8c6fdd5410d3bf5e6f7db307edd7468c1dc549 submitted by /u/promach [link] [comments]  ( 1 min )
    How to train RL by multiple action-state pairs
    I use Reinforcement Learning stable baselines. Because I can only get the reward after a period of time when the action is obtained, and other actions needed to be generated by the model, I use the current model to generate actions for multiple states and then store them. When rewards of these actions are obtained, I train the old model using these action-state-reward pairs together. In this case, how can I do it? submitted by /u/Melodic-Ad1422 [link] [comments]  ( 1 min )
    How to let RL learns from multiple state-action pairs at the same time?
    The model.learn function in SB3 is to generate one action according to the state and then obtain the reward, then the model is trained. However, if I have multiple state-action-reward pairs generated by an (old) model, can I feed these pairs to RL and let it learn from them and get a new trained model? It seems I can't do it using the current model.learn function because the current workflow of model.learn is: (old model) state->action->reward (get a new model) state->action->reward (get a new model)... But what I want is: (old model) multiple state-action-reward pairs (get a new model). I can't use the current model.learn function because after it learned from one state-action pair, the model will be updated, and the generated action may be different from the action generated by the old model. Anyone can help me? Thanks in advance. I have read the [documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html?highlight=model.learn(#stable_baselines3.dqn.DQN.learn)) submitted by /u/Melodic-Ad1422 [link] [comments]  ( 1 min )
  • Open

    How The Barcode Registry detects counterfeit products using object detection and Amazon SageMaker
    This is a guest post authored by Andrew Masek, Software Engineer at The Barcode Registry and Erik Quisling, CEO of The Barcode Registry. Product counterfeiting is the single largest criminal enterprise in the world. Growing over 10,000% in the last two decades, sales of counterfeit goods now total $1.7 trillion per year worldwide, which is […]  ( 6 min )
  • Open

    µTransfer: A technique for hyperparameter tuning of enormous neural networks
    Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial […] The post µTransfer: A technique for hyperparameter tuning of enormous neural networks appeared first on Microsoft Research.  ( 11 min )
  • Open

    Benefits of FAANG for Data Science & Machine Learning roles
    🔥 I received several messages about the benefits of joining FAANG and similar companies and startups in the context of Data Science…  ( 2 min )
  • Open

    NeuraQuarium - Neural Net Evolution Simulator
    submitted by /u/urocyon_dev [link] [comments]
    Information on image processing
    Hi I'm looking for either a existing app or the information to create one myself.. I already have some programming knowledge. I'm basically wanting to take a image say 3d render of a car, and use ai and some data sets, photographs of the original car real car and then apply any missing details enhancing the final image without changing paint colours ect of the render.. Does anyone know if this is possible or where to start with this? submitted by /u/javsezlol [link] [comments]  ( 1 min )
  • Open

    TourBERT: A pretrained language model for the tourism industry. (arXiv:2201.07449v2 [cs.CL] UPDATED)
    The Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most important and state-of-the-art models for natural language. However, it has also been shown that for domain-specific tasks it is helpful to pretrain BERT on a domain-specific corpus. In this paper, we present TourBERT, a pretrained language model for tourism. We describe how TourBERT was developed and evaluated. The evaluations show that TourBERT is outperforming BERT in all tourism-specific tasks.  ( 2 min )
    Reconstructing nonlinear dynamical systems from multi-modal time series. (arXiv:2111.02922v2 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.  ( 2 min )
    Benchmark Evaluation of Counterfactual Algorithms for XAI: From a White Box to a Black Box. (arXiv:2203.02399v1 [cs.LG])
    Counterfactual explanations have recently been brought to light as a potentially crucial response to obtaining human-understandable explanations from predictive models in Explainable Artificial Intelligence (XAI). Despite the fact that various counterfactual algorithms have been proposed, the state of the art research still lacks standardised protocols to evaluate the quality of counterfactual explanations. In this work, we conducted a benchmark evaluation across different model agnostic counterfactual algorithms in the literature (DiCE, WatcherCF, prototype, unjustifiedCF), and we investigated the counterfactual generation process on different types of machine learning models ranging from a white box (decision tree) to a grey-box (random forest) and a black box (neural network). We evaluated the different counterfactual algorithms using several metrics including proximity, interpretability and functionality for five datasets. The main findings of this work are the following: (1) without guaranteeing plausibility in the counterfactual generation process, one cannot have meaningful evaluation results. This means that all explainable counterfactual algorithms that do not take into consideration plausibility in their internal mechanisms cannot be evaluated with the current state of the art evaluation metrics; (2) the counterfactual generated are not impacted by the different types of machine learning models; (3) DiCE was the only tested algorithm that was able to generate actionable and plausible counterfactuals, because it provides mechanisms to constraint features; (4) WatcherCF and UnjustifiedCF are limited to continuous variables and can not deal with categorical data.  ( 2 min )
    Effective Model Sparsification by Scheduled Grow-and-Prune Methods. (arXiv:2106.09857v3 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer for pruning. The limitations of these algorithms reduce the level of achievable sparsity. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without having to pre-train a dense model. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training. Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification. As an example, a 90% non-uniform sparse ResNet-50 model obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the previous SOTA results by 1.5%. Code available at: https://github.com/boone891214/GaP.  ( 2 min )
    Eye Know You: Metric Learning for End-to-end Biometric Authentication Using Eye Movements from a Longitudinal Dataset. (arXiv:2104.10489v2 [cs.HC] UPDATED)
    The permanence of eye movements as a biometric modality remains largely unexplored in the literature. The present study addresses this limitation by evaluating a novel exponentially-dilated convolutional neural network for eye movement authentication using a recently proposed longitudinal dataset known as GazeBase. The network is trained using multi-similarity loss, which directly enables the enrollment and authentication of out-of-sample users. In addition, this study includes an exhaustive analysis of the effects of evaluating on various tasks and downsampling from 1000 Hz to several lower sampling rates. Our results reveal that reasonable authentication accuracy may be achieved even during both a low-cognitive-load task and at low sampling rates. Moreover, we find that eye movements are quite resilient against template aging after as long as 3 years.  ( 2 min )
    Star Temporal Classification: Sequence Classification with Partially Labeled Data. (arXiv:2201.12208v2 [cs.LG] UPDATED)
    We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.  ( 2 min )
    Computational Fluid Dynamics and Machine Learning as tools for Optimization of Micromixers geometry. (arXiv:2203.02498v1 [physics.flu-dyn])
    This work explores a new approach for optimization in the field of microfluidics, using the combination of CFD (Computational Fluid Dynamics), and Machine Learning techniques. The objective of this combination is to enable global optimization with lower computational cost. The initial geometry is inspired in a Y-type micromixer with cylindrical grooves on the surface of the main channel and obstructions inside it. Simulations for circular obstructions were carried out using the OpenFOAM software to observe the influences of obstacles. The effects of obstruction diameter (OD), and offset (OF) in the range of [20,140] mm and [10,160] mm, respectively, on percentage of mixing ($\varphi$), pressure drop ($\Delta P$) and energy cost ($\Delta P/\varphi$) were investigated. Numerical experiments were analyzed using machine learning. Firstly, a neural network was used to train the dataset composed by the inputs OD and OF and outputs $\varphi$ and $\Delta P$. The objective functions (ObF) chosen to numerically optimize the performance of micromixers with grooves and obstructions were $\varphi$, $\Delta P$, $\Delta P/\varphi$. The genetic algorithm obtained the geometry that offers the maximum value of $\varphi$ and the minimum value of $\Delta P_s$. The results show that $\varphi$ increases monotonically with increasing OD at all values of OF. The inverse is observed with increasing offset. Furthermore, the results reveal that $\Delta P$ e $\Delta P/\varphi$ also increase with OD. On the other hand, the pressure drop and the cost of mixing energy present a maximum close to the lowest values of OF. Finally, the optimal value obtained for the diameter was OD=131 mm and for the offset OF=10 mm, which corresponds to obstruction of medium size close to the channel wall.  ( 2 min )
    Deep Visual Navigation under Partial Observability. (arXiv:2109.07752v2 [cs.RO] UPDATED)
    How can a robot navigate successfully in rich and diverse environments, indoors or outdoors, along office corridors or trails on the grassland, on the flat ground or the staircase? To this end, this work aims to address three challenges: (i) complex visual observations, (ii) partial observability of local visual sensing, and (iii) multimodal robot behaviors conditioned on both the local environment and the global navigation objective. We propose to train a neural network (NN) controller for local navigation via imitation learning. To tackle complex visual observations, we extract multi-scale spatial representations through CNNs. To tackle partial observability, we aggregate multi-scale spatial information over time and encode it in LSTMs. To learn multimodal behaviors, we use a separate memory module for each behavior mode. Importantly, we integrate the multiple neural network modules into a unified controller that achieves robust performance for visual navigation in complex, partially observable environments. We implemented the controller on the quadrupedal Spot robot and evaluated it on three challenging tasks: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. The experiments show that the proposed NN architecture significantly improves navigation performance.  ( 2 min )
    Markov subsampling based Huber Criterion. (arXiv:2112.06134v2 [stat.ML] UPDATED)
    Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as a refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large scale simulations and real data examples.  ( 2 min )
    OLIVAW: Mastering Othello without Human Knowledge, nor a Fortune. (arXiv:2103.17228v4 [cs.LG] UPDATED)
    We introduce OLIVAW, an AI Othello player adopting the design principles of the famous AlphaGo programs. The main motivation behind OLIVAW was to attain exceptional competence in a non-trivial board game at a tiny fraction of the cost of its illustrious predecessors. In this paper, we show how the AlphaGo Zero's paradigm can be successfully applied to the popular game of Othello using only commodity hardware and free cloud services. While being simpler than Chess or Go, Othello maintains a considerable search space and difficulty in evaluating board positions. To achieve this result, OLIVAW implements some improvements inspired by recent works to accelerate the standard AlphaGo Zero learning process. The main modification implies doubling the positions collected per game during the training phase, by including also positions not played but largely explored by the agent. We tested the strength of OLIVAW in three different ways: by pitting it against Edax, the strongest open-source Othello engine, by playing anonymous games on the web platform OthelloQuest, and finally in two in-person matches against top-notch human players: a national champion and a former world champion.  ( 2 min )
    Differentiable Control Barrier Functions for Vision-based End-to-End Autonomous Driving. (arXiv:2203.02401v1 [cs.RO])
    Guaranteeing safety of perception-based learning systems is challenging due to the absence of ground-truth state information unlike in state-aware control scenarios. In this paper, we introduce a safety guaranteed learning framework for vision-based end-to-end autonomous driving. To this end, we design a learning system equipped with differentiable control barrier functions (dCBFs) that is trained end-to-end by gradient descent. Our models are composed of conventional neural network architectures and dCBFs. They are interpretable at scale, achieve great test performance under limited training data, and are safety guaranteed in a series of autonomous driving scenarios such as lane keeping and obstacle avoidance. We evaluated our framework in a sim-to-real environment, and tested on a real autonomous car, achieving safe lane following and obstacle avoidance via Augmented Reality (AR) and real parked vehicles.  ( 2 min )
    Relay-Assisted Cooperative Federated Learning. (arXiv:2107.09518v3 [cs.IT] UPDATED)
    Federated learning (FL) has recently emerged as a promising technology to enable artificial intelligence (AI) at the network edge, where distributed mobile devices collaboratively train a shared AI model under the coordination of an edge server. To significantly improve the communication efficiency of FL, over-the-air computation allows a large number of mobile devices to concurrently upload their local models by exploiting the superposition property of wireless multi-access channels. Due to wireless channel fading, the model aggregation error at the edge server is dominated by the weakest channel among all devices, causing severe straggler issues. In this paper, we propose a relay-assisted cooperative FL scheme to effectively address the straggler issue. In particular, we deploy multiple half-duplex relays to cooperatively assist the devices in uploading the local model updates to the edge server. The nature of the over-the-air computation poses system objectives and constraints that are distinct from those in traditional relay communication systems. Moreover, the strong coupling between the design variables renders the optimization of such a system challenging. To tackle the issue, we propose an alternating-optimization-based algorithm to optimize the transceiver and relay operation with low complexity. Then, we analyze the model aggregation error in a single-relay case and show that our relay-assisted scheme achieves a smaller error than the one without relays provided that the relay transmit power and the relay channel gains are sufficiently large. The analysis provides critical insights on relay deployment in the implementation of cooperative FL. Extensive numerical results show that our design achieves faster convergence compared with state-of-the-art schemes.  ( 3 min )
    Solving Marginal MAP Exactly by Probabilistic Circuit Transformations. (arXiv:2111.04833v2 [cs.AI] UPDATED)
    Probabilistic circuits (PCs) are a class of tractable probabilistic models that allow efficient, often linear-time, inference of queries such as marginals and most probable explanations (MPE). However, marginal MAP, which is central to many decision-making problems, remains a hard query for PCs unless they satisfy highly restrictive structural constraints. In this paper, we develop a pruning algorithm that removes parts of the PC that are irrelevant to a marginal MAP query, shrinking the PC while maintaining the correct solution. This pruning technique is so effective that we are able to build a marginal MAP solver based solely on iteratively transforming the circuit -- no search is required. We empirically demonstrate the efficacy of our approach on real-world datasets.  ( 2 min )
    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. (arXiv:2203.02395v1 [cs.SD])
    In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.  ( 2 min )
    MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. (arXiv:2110.02178v2 [cs.CV] UPDATED)
    Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets  ( 2 min )
    Learning Mixtures of Permutations: Groups of Pairwise Comparisons and Combinatorial Method of Moments. (arXiv:2009.06784v2 [math.ST] UPDATED)
    In applications such as rank aggregation, mixture models for permutations are frequently used when the population exhibits heterogeneity. In this work, we study the widely used Mallows mixture model. In the high-dimensional setting, we propose a polynomial-time algorithm that learns a Mallows mixture of permutations on $n$ elements with the optimal sample complexity that is proportional to $\log n$, improving upon previous results that scale polynomially with $n$. In the high-noise regime, we characterize the optimal dependency of the sample complexity on the noise parameter. Both objectives are accomplished by first studying demixing permutations under a noiseless query model using groups of pairwise comparisons, which can be viewed as moments of the mixing distribution, and then extending these results to the noisy Mallows model by simulating the noiseless oracle.  ( 2 min )
    Optimal Algorithms for Differentially Private Stochastic Monotone Variational Inequalities and Saddle-Point Problems. (arXiv:2104.02988v2 [math.OC] UPDATED)
    In this work, we conduct the first systematic study of stochastic variational inequality (SVI) and stochastic saddle point (SSP) problems under the constraint of differential privacy-(DP). We propose two algorithms: Noisy Stochastic Extragradient (NSEG) and Noisy Inexact Stochastic Proximal Point (NISPP). We show that sampling with replacement variants of these algorithms attain the optimal risk for DP-SVI and DP-SSP. Key to our analysis is the investigation of algorithmic stability bounds, both of which are new even in the nonprivate case, together with a novel "stability implies generalization" result for the gap functions for SVI and SSP problems. The dependence of the running time of these algorithms, with respect to the dataset size $n$, is $n^2$ for NSEG and $\widetilde{O}(n^{3/2})$ for NISPP.  ( 2 min )
    No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds. (arXiv:2203.02502v1 [cs.LG])
    Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities, e.g., repeated samples and sampling bias, that manifest imbalanced clustering. We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by the centroids, i.e., samples from the same cluster should not be more than $2r$ apart in terms of $\ell_2$ distance. We achieve this constraint by solving a semi-definite program, followed by a linear assignment problem with quadratic constraints. Through qualitative results, we show that our proposed method is robust towards dataset imbalances and sampling artifacts. To the best of our knowledge, ours is the first constrained k-means clustering method with hard radius constraints. Codes at https://bit.ly/kmeans-constrained  ( 2 min )
    MemStream: Memory-Based Streaming Anomaly Detection. (arXiv:2106.03837v2 [cs.LG] UPDATED)
    Given a stream of entries over time in a multi-dimensional data setting where concept drift is present, how can we detect anomalous activities? Most of the existing unsupervised anomaly detection approaches seek to detect anomalous events in an offline fashion and require a large amount of data for training. This is not practical in real-life scenarios where we receive the data in a streaming manner and do not know the size of the stream beforehand. Thus, we need a data-efficient method that can detect and adapt to changing data trends, or concept drift, in an online manner. In this work, we propose MemStream, a streaming anomaly detection framework, allowing us to detect unusual events as they occur while being resilient to concept drift. We leverage the power of a denoising autoencoder to learn representations and a memory module to learn the dynamically changing trend in data without the need for labels. We prove the optimum memory size required for effective drift handling. Furthermore, MemStream makes use of two architecture design choices to be robust to memory poisoning. Experimental results show the effectiveness of our approach compared to state-of-the-art streaming baselines using $2$ synthetic datasets and $11$ real-world datasets.  ( 2 min )
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v1 [stat.ML])
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 2 min )
    RGB cameras failures and their effects in autonomous driving applications. (arXiv:2008.05938v3 [cs.CV] UPDATED)
    RGB cameras are one of the most relevant sensors for autonomous driving applications. It is undeniable that failures of vehicle cameras may compromise the autonomous driving task, possibly leading to unsafe behaviors when images that are subsequently processed by the driving system are altered. To support the definition of safe and robust vehicle architectures and intelligent systems, in this paper we define the failure modes of a vehicle camera, together with an analysis of effects and known mitigations. Further, we build a software library for the generation of the corresponding failed images and we feed them to six object detectors for mono and stereo cameras and to the self-driving agent of an autonomous driving simulator. The resulting misbehaviors with respect to operating with clean images allow a better understanding of failures effects and the related safety risks in image-based applications.  ( 2 min )
    AutoDIME: Automatic Design of Interesting Multi-Agent Environments. (arXiv:2203.02481v1 [cs.LG])
    Designing a distribution of environments in which RL agents can learn interesting and useful skills is a challenging and poorly understood task, for multi-agent environments the difficulties are only exacerbated. One approach is to train a second RL agent, called a teacher, who samples environments that are conducive for the learning of student agents. However, most previous proposals for teacher rewards do not generalize straightforwardly to the multi-agent setting. We examine a set of intrinsic teacher rewards derived from prediction problems that can be applied in multi-agent settings and evaluate them in Mujoco tasks such as multi-agent Hide and Seek as well as a diagnostic single-agent maze task. Of the intrinsic rewards considered we found value disagreement to be most consistent across tasks, leading to faster and more reliable emergence of advanced skills in Hide and Seek and the maze task. Another candidate intrinsic reward considered, value prediction error, also worked well in Hide and Seek but was susceptible to noisy-TV style distractions in stochastic environments. Policy disagreement performed well in the maze task but did not speed up learning in Hide and Seek. Our results suggest that intrinsic teacher rewards, and in particular value disagreement, are a promising approach for automating both single and multi-agent environment design.  ( 2 min )
    Contextual Sentence Classification: Detecting Sustainability Initiatives in Company Reports. (arXiv:2110.03727v2 [cs.CL] UPDATED)
    We introduce the novel task of detecting sustainability initiatives in company reports. Given a full report, the aim is to automatically identify mentions of practical activities that a company has performed in order to tackle specific societal issues. New methods for identifying continuous sentence spans need to be developed for capturing the multi-sentence structure of individual sustainability initiatives. We release a new dataset of company reports in which the text has been manually annotated with sustainability initiatives. We also evaluate different models for initiative detection, introducing a novel aggregation and evaluation methodology. Our proposed architecture uses sequences of consecutive sentences to account for contextual information when making classification decisions at the individual sentence level.
    Task Attended Meta-Learning for Few-Shot Learning. (arXiv:2106.10642v1 [cs.LG] CROSS LISTED)
    Meta-learning (ML) has emerged as a promising direction in learning models under constrained resource settings like few-shot learning. The popular approaches for ML either learn a generalizable initial model or a generic parametric optimizer through episodic training. The former approaches leverage the knowledge from a batch of tasks to learn an optimal prior. In this work, we study the importance of a batch for ML. Specifically, we first incorporate a batch episodic training regimen to improve the learning of the generic parametric optimizer. We also hypothesize that the common assumption in batch episodic training that each task in a batch has an equal contribution to learning an optimal meta-model need not be true. We propose to weight the tasks in a batch according to their "importance" in improving the meta-model's learning. To this end, we introduce a training curriculum motivated by selective focus in humans, called task attended meta-training, to weight the tasks in a batch. Task attention is a standalone module that can be integrated with any batch episodic training regimen. The comparisons of the models with their non-task-attended counterparts on complex datasets like miniImageNet and tieredImageNet validate its effectiveness.
    Natural Reweighted Wake-Sleep. (arXiv:2008.06687v2 [cs.LG] UPDATED)
    Helmholtz Machines (HMs) are a class of generative models composed of two Sigmoid Belief Networks (SBNs), acting as an encoder and a decoder. These models are commonly trained using a two-step optimization algorithm called Wake-Sleep (WS) and more recently by improved versions, such as the Reweighted Wake-Sleep (RWS) and Bidirectional Helmholtz Machines (BiHM). The locality of the connections in a SBN induces sparsity in the Fisher information matrix associated to the model, in the form of a finely-grained block-diagonal structure. In this paper we exploit this property to efficiently train SBNs and HMs using the natural gradient. We present a novel algorithm called Natural Reweighted Wake-Sleep (NRWS), which corresponds to a geometric adaptation of the Reweighted Wake-Sleep, where, differently from most of the previous work, the natural gradient is computed without the need of introducing any approximation of the structure of the Fisher Information Matrix. The experiments performed on standard datasets from the literature show a consistent improvement of NRWS not only with respect to its non-geometric baseline but also with respect to state-of-the-art training algorithms for HMs. The improvement is quantified both in terms of speed of convergence as well as value of the log-likelihood reached after training.
    Towards Optimally Efficient Search with Deep Learning for Large-Scale MIMO Systems. (arXiv:2101.02420v6 [cs.LG] UPDATED)
    This paper investigates the optimal signal detection problem with a particular interest in large-scale multiple-input multiple-output (MIMO) systems. The problem is NP-hard and can be solved optimally by searching the shortest path on the decision tree. Unfortunately, the existing optimal search algorithms often involve prohibitively high complexities, which indicates that they are infeasible in large-scale MIMO systems. To address this issue, we propose a general heuristic search algorithm, namely, hyper-accelerated tree search (HATS) algorithm. The proposed algorithm employs a deep neural network (DNN) to estimate the optimal heuristic, and then use the estimated heuristic to speed up the underlying memory-bounded search algorithm. This idea is inspired by the fact that the underlying heuristic search algorithm reaches the optimal efficiency with the optimal heuristic function. Simulation results show that the proposed algorithm reaches almost the optimal bit error rate (BER) performance in large-scale systems, while the memory size can be bounded. In the meanwhile, it visits nearly the fewest tree nodes. This indicates that the proposed algorithm reaches almost the optimal efficiency in practical scenarios, and thereby it is applicable for large-scale systems. Besides, the code for this paper is available at \url{https://github.com/skypitcher/hats}.
    The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss. (arXiv:2112.04590v2 [cs.LG] UPDATED)
    Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.
    Learning from Label Proportions by Learning with Label Noise. (arXiv:2203.02496v1 [cs.LG])
    Learning from label proportions (LLP) is a weakly supervised classification problem where data points are grouped into bags, and the label proportions within each bag are observed instead of the instance-level labels. The task is to learn a classifier to predict the individual labels of future individual instances. Prior work on LLP for multi-class data has yet to develop a theoretically grounded algorithm. In this work, we provide a theoretically grounded approach to LLP based on a reduction to learning with label noise, using the forward correction (FC) loss of \citet{Patrini2017MakingDN}. We establish an excess risk bound and generalization error analysis for our approach, while also extending the theory of the FC loss which may be of independent interest. Our approach demonstrates improved empirical performance in deep learning scenarios across multiple datasets and architectures, compared to the leading existing methods.
    Dynamic Backdoor Attacks Against Machine Learning Models. (arXiv:2003.03675v2 [cs.CR] UPDATED)
    Machine learning (ML) has made tremendous progress during the past decade and is being adopted in various critical real-world applications. However, recent research has shown that ML models are vulnerable to multiple security and privacy attacks. In particular, backdoor attacks against ML models have recently raised a lot of awareness. A successful backdoor attack can cause severe consequences, such as allowing an adversary to bypass critical authentication systems. Current backdooring techniques rely on adding static triggers (with fixed patterns and locations) on ML model inputs which are prone to detection by the current backdoor detection mechanisms. In this paper, we propose the first class of dynamic backdooring techniques against deep neural networks (DNN), namely Random Backdoor, Backdoor Generating Network (BaN), and conditional Backdoor Generating Network (c-BaN). Triggers generated by our techniques can have random patterns and locations, which reduce the efficacy of the current backdoor detection mechanisms. In particular, BaN and c-BaN based on a novel generative network are the first two schemes that algorithmically generate triggers. Moreover, c-BaN is the first conditional backdooring technique that given a target label, it can generate a target-specific trigger. Both BaN and c-BaN are essentially a general framework which renders the adversary the flexibility for further customizing backdoor attacks. We extensively evaluate our techniques on three benchmark datasets: MNIST, CelebA, and CIFAR-10. Our techniques achieve almost perfect attack performance on backdoored data with a negligible utility loss. We further show that our techniques can bypass current state-of-the-art defense mechanisms against backdoor attacks, including ABS, Februus, MNTD, Neural Cleanse, and STRIP.
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v1 [cs.LG])
    The inception of Relational Graph Convolutional Networks (R-GCNs) marked a milestone in the Semantic Web domain as it allows for end-to-end training of machine learning models that operate on Knowledge Graphs (KGs). R-GCNs generate a representation for a node of interest by repeatedly aggregating parametrised, relation-specific transformations of its neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned parameters. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which constructs embeddings for nodes in the KG by aggregating randomly transformed random information from neigbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings. The implications of these results are two-fold: on the one hand, our technique can be used as a quick baseline that novel KG embedding methods should be able to beat. On the other hand, it demonstrates that further research might reveal more parameter-efficient inductive biases for KGs.  ( 2 min )
    Solving Inverse Problems by Joint Posterior Maximization with Autoencoding Prior. (arXiv:2103.01648v3 [stat.ML] UPDATED)
    In this work we address the problem of solving ill-posed inverse problems in imaging where the prior is a variational autoencoder (VAE). Specifically we consider the decoupled case where the prior is trained once and can be reused for many different log-concave degradation models without retraining. Whereas previous MAP-based approaches to this problem lead to highly non-convex optimization algorithms, our approach computes the joint (space-latent) MAP that naturally leads to alternate optimization algorithms and to the use of a stochastic encoder to accelerate computations. The resulting technique (JPMAP) performs Joint Posterior Maximization using an Autoencoding Prior. We show theoretical and experimental evidence that the proposed objective function is quite close to bi-convex. Indeed it satisfies a weak bi-convexity property which is sufficient to guarantee that our optimization scheme converges to a stationary point. We also highlight the importance of correctly training the VAE using a denoising criterion, in order to ensure that the encoder generalizes well to out-of-distribution images, without affecting the quality of the generative model. This simple modification is key to providing robustness to the whole procedure. Finally we show how our joint MAP methodology relates to more common MAP approaches, and we propose a continuation scheme that makes use of our JPMAP algorithm to provide more robust MAP estimates. Experimental results also show the higher quality of the solutions obtained by our JPMAP approach with respect to other non-convex MAP approaches which more often get stuck in spurious local optima.
    Physics-constrained deep neural network method for estimating parameters in a redox flow battery. (arXiv:2106.11451v2 [physics.chem-ph] UPDATED)
    In this paper, we present a physics-constrained deep neural network (PCDNN) method for parameter estimation in the zero-dimensional (0D) model of the vanadium redox flow battery (VRFB). In this approach, we use deep neural networks (DNNs) to approximate the model parameters as functions of the operating conditions. This method allows the integration of the VRFB computational models as the physical constraints in the parameter learning process, leading to enhanced accuracy of parameter estimation and cell voltage prediction. Using an experimental dataset, we demonstrate that the PCDNN method can estimate model parameters for a range of operating conditions and improve the 0D model prediction of voltage compared to the 0D model prediction with constant operation-condition-independent parameters estimated with traditional inverse methods. We also demonstrate that the PCDNN approach has an improved generalization ability for estimating parameter values for operating conditions not used in the DNN training.
    Interpolating between sampling and variational inference with infinite stochastic mixtures. (arXiv:2110.09618v2 [stat.ML] UPDATED)
    Sampling and Variational Inference (VI) are two large families of methods for approximate inference that have complementary strengths. Sampling methods excel at approximating arbitrary probability distributions, but can be inefficient. VI methods are efficient, but may misrepresent the true distribution. Here, we develop a general framework where approximations are stochastic mixtures of simple component distributions. Both sampling and VI can be seen as special cases: in sampling, each mixture component is a delta-function and is chosen stochastically, while in standard VI a single component is chosen to minimize divergence. We derive a practical method that interpolates between sampling and VI by solving an optimization problem over a mixing distribution. Intermediate inference methods then arise by varying a single parameter. Our method provably improves on sampling (reducing variance) and on VI (reducing bias+variance despite increasing variance). We demonstrate our method's bias/variance trade-off in practice on reference problems, and we compare outcomes to commonly used sampling and VI methods. This work takes a step towards a highly flexible yet simple family of inference methods that combines the complementary strengths of sampling and VI.
    Multimodal Data Visualization and Denoising with Integrated Diffusion. (arXiv:2102.06757v3 [cs.LG] UPDATED)
    We propose a method called integrated diffusion for combining multimodal datasets, or data gathered via several different measurements on the same system, to create a joint data diffusion operator. As real world data suffers from both local and global noise, we introduce mechanisms to optimally calculate a diffusion operator that reflects the combined information from both modalities. We show the utility of this joint operator in data denoising, visualization and clustering, performing better than other methods to integrate and analyze multimodal data. We apply our method to multi-omic data generated from blood cells, measuring both gene expression and chromatin accessibility. Our approach better visualizes the geometry of the joint data, captures known cross-modality associations and identifies known cellular populations. More generally, integrated diffusion is broadly applicable to multimodal datasets generated in many medical and biological systems.
    When should agents explore?. (arXiv:2108.11811v2 [cs.LG] UPDATED)
    Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it makes sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyper-parameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.
    Towards Synthesizing Twelve-Lead Electrocardiograms from Two Asynchronous Leads. (arXiv:2103.00006v3 [eess.SP] UPDATED)
    The electrocardiogram (ECG) records electrical signals in a non-invasive way to observe the condition of the heart, typically looking at the heart from 12 different directions. Several types of the cardiac disease are diagnosed by using 12-lead ECGs Recently, various wearable devices have enabled immediate access to the ECG without the use of wieldy equipment. However, they only provide ECGs with a couple of leads. This results in an inaccurate diagnosis of cardiac disease due to lacking of required leads. We propose a deep generative model for ECG synthesis from two asynchronous leads to ten leads. It first represents a heart condition referring to two leads, and then generates ten leads based on the represented heart condition. Both the rhythm and amplitude of leads generated resemble those of the original ones, while the technique removes noise and the baseline wander appearing in the original leads. As a data augmentation method, our model improves the classification performance of models compared with models using ECGs with only one or two leads.
    Interpretable Off-Policy Learning via Hyperbox Search. (arXiv:2203.02473v1 [stat.ML])
    Personalized treatment decisions have become an integral part of modern medicine. Thereby, the aim is to make treatment decisions based on individual patient characteristics. Numerous methods have been developed for learning such policies from observational data that achieve the best outcome across a certain policy class. Yet these methods are rarely interpretable. However, interpretability is often a prerequisite for policy learning in clinical practice. In this paper, we propose an algorithm for interpretable off-policy learning via hyperbox search. In particular, our policies can be represented in disjunctive normal form (i.e., OR-of-ANDs) and are thus intelligible. We prove a universal approximation theorem that shows that our policy class is flexible enough to approximate any measurable function arbitrarily well. For optimization, we develop a tailored column generation procedure within a branch-and-bound framework. Using a simulation study, we demonstrate that our algorithm outperforms state-of-the-art methods from interpretable off-policy learning in terms of regret. Using real-word clinical data, we perform a user study with actual clinical experts, who rate our policies as highly interpretable.
    ECG-ATK-GAN: Robustness against Adversarial Attacks on ECGs using Conditional Generative Adversarial Networks. (arXiv:2110.09983v2 [eess.SP] UPDATED)
    Automating arrhythmia detection from ECG requires a robust and trusted system that retains high accuracy under electrical disturbances. Many machine learning approaches have reached human-level performance in classifying arrhythmia from ECGs. However, these architectures are vulnerable to adversarial attacks, which can misclassify ECG signals by decreasing the model's accuracy. Adversarial attacks are small crafted perturbations injected in the original data which manifest the out-of-distribution shifts in signal to misclassify the correct class. Thus, security concerns arise for false hospitalization and insurance fraud abusing these perturbations. To mitigate this problem, we introduce a novel Conditional Generative Adversarial Network (GAN), robust against adversarial attacked ECG signals and retaining high accuracy. Our architecture integrates a new class-weighted objective function for adversarial perturbation identification and two novel blocks for discerning and combining out-of-distribution shifts in signals in the learning process for accurately classifying various arrhythmia types. Furthermore, we benchmark our architecture on six different white and black-box attacks and compare with other recently proposed arrhythmia classification models on two publicly available ECG arrhythmia datasets. The experiment confirms that our model is more robust against such adversarial attacks for classifying arrhythmia with high accuracy.
    Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. (arXiv:2111.14330v2 [cs.CV] UPDATED)
    DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr
    The Curse of Zero Task Diversity: On the Failure of Transfer Learning to Outperform MAML and their Empirical Equivalence. (arXiv:2112.13121v3 [cs.LG] UPDATED)
    Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -- thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by proposing a novel metric -- the diversity coefficient -- to measure the diversity of tasks in a few-shot learning benchmark. We hypothesize that the diversity coefficient of the few-shot learning benchmark is predictive of whether meta-learning solutions will succeed or not. Using the diversity coefficient, we show that the MiniImagenet benchmark has zero diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions. Specifically, we empirically find that a diversity coefficient of zero correlates with a high similarity between transfer learning and Model-Agnostic Meta-Learning (MAML) learned solutions in terms of meta-accuracy (at meta-test time). Therefore, we conjecture meta-learned solutions have the same meta-test performance as transfer learning when the diversity coefficient is zero. Our work provides the first test of whether diversity correlates with meta-learning success.
    Smart Households Demand Response Management with Micro Grid. (arXiv:1907.03641v1 [eess.SY] CROSS LISTED)
    Nowadays the emerging smart grid technology opens up the possibility of two-way communication between customers and energy utilities. Demand Response Management (DRM) offers the promise of saving money for commercial customers and households while helps utilities operate more efficiently. In this paper, an Incentive-based Demand Response Optimization (IDRO) model is proposed to efficiently schedule household appliances for minimum usage during peak hours. The proposed method is a multi-objective optimization technique based on Nonlinear Auto-Regressive Neural Network (NAR-NN) which considers energy provided by the utility and rooftop installed photovoltaic (PV) system. The proposed method is tested and verified using 300 case studies (household). Data analysis for a period of one year shows a noticeable improvement in power factor and customers bill.
    ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks. (arXiv:2201.02263v2 [cs.CV] UPDATED)
    State-of-the-art stereo matching networks trained only on synthetic data often fail to generalize to more challenging real data domains. In this paper, we attempt to unfold an important factor that hinders the networks from generalizing across domains: through the lens of shortcut learning. We demonstrate that the learning of feature representations in stereo matching networks is heavily influenced by synthetic data artefacts (shortcut attributes). To mitigate this issue, we propose an Information-Theoretic Shortcut Avoidance~(ITSA) approach to automatically restrict shortcut-related information from being encoded into the feature representations. As a result, our proposed method learns robust and shortcut-invariant features by minimizing the sensitivity of latent features to input variations. To avoid the prohibitive computational cost of direct input sensitivity optimization, we propose an effective yet feasible algorithm to achieve robustness. We show that using this method, state-of-the-art stereo matching networks that are trained purely on synthetic data can effectively generalize to challenging and previously unseen real data scenarios. Importantly, the proposed method enhances the robustness of the synthetic trained networks to the point that they outperform their fine-tuned counterparts (on real data) for challenging out-of-domain stereo datasets.
    Invariance encoding in sliced-Wasserstein space for image classification with limited training data. (arXiv:2201.02980v2 [cs.CV] UPDATED)
    Deep convolutional neural networks (CNNs) are broadly considered to be state-of-the-art generic end-to-end image classification systems. However, they are known to underperform when training data are limited and thus require data augmentation strategies that render the method computationally expensive and not always effective. Rather than using a data augmentation strategy to encode invariances as typically done in machine learning, here we propose to mathematically augment a nearest subspace classification model in sliced-Wasserstein space by exploiting certain mathematical properties of the Radon Cumulative Distribution Transform (R-CDT), a recently introduced image transform. We demonstrate that for a particular type of learning problem, our mathematical solution has advantages over data augmentation with deep CNNs in terms of classification accuracy and computational complexity, and is particularly effective under a limited training data setting. The method is simple, effective, computationally efficient, non-iterative, and requires no parameters to be tuned. Python code implementing our method is available at https://github.com/rohdelab/mathematical_augmentation. Our method is integrated as a part of the software package PyTransKit, which is available at https://github.com/rohdelab/PyTransKit.
    The Machine Learning for Combinatorial Optimization Competition (ML4CO): Results and Insights. (arXiv:2203.02433v1 [cs.LG])
    Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning as a new approach for solving combinatorial problems, either directly as solvers or by enhancing exact solvers. Based on this context, the ML4CO aims at improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components. The competition featured three challenging tasks: finding the best feasible solution, producing the tightest optimality certificate, and giving an appropriate solver configuration. Three realistic datasets were considered: balanced item placement, workload apportionment, and maritime inventory routing. This last dataset was kept anonymous for the contestants.
    Pedestrian Stop and Go Forecasting with Hybrid Feature Fusion. (arXiv:2203.02489v1 [cs.CV])
    Forecasting pedestrians' future motions is essential for autonomous driving systems to safely navigate in urban areas. However, existing prediction algorithms often overly rely on past observed trajectories and tend to fail around abrupt dynamic changes, such as when pedestrians suddenly start or stop walking. We suggest that predicting these highly non-linear transitions should form a core component to improve the robustness of motion prediction algorithms. In this paper, we introduce the new task of pedestrian stop and go forecasting. Considering the lack of suitable existing datasets for it, we release TRANS, a benchmark for explicitly studying the stop and go behaviors of pedestrians in urban traffic. We build it from several existing datasets annotated with pedestrians' walking motions, in order to have various scenarios and behaviors. We also propose a novel hybrid model that leverages pedestrian-specific and scene features from several modalities, both video sequences and high-level attributes, and gradually fuses them to integrate multiple levels of context. We evaluate our model and several baselines on TRANS, and set a new benchmark for the community to work on pedestrian stop and go forecasting.  ( 2 min )
    MAML is a Noisy Contrastive Learner in Classification. (arXiv:2106.15367v3 [cs.LG] UPDATED)
    Model-agnostic meta-learning (MAML) is one of the most popular and widely adopted meta-learning algorithms, achieving remarkable success in various learning problems. Yet, with the unique design of nested inner-loop and outer-loop updates, which govern the task-specific and meta-model-centric learning, respectively, the underlying learning objective of MAML remains implicit and thus impedes a more straightforward understanding of it. In this paper, we provide a new perspective of the working mechanism of MAML. We discover that MAML is analogous to a meta-learner using a supervised contrastive objective. The query features are pulled towards the support features of the same class and against those of different classes. Such contrastiveness is experimentally verified via an analysis based on the cosine similarity. Moreover, we reveal that vanilla MAML has an undesirable interference term originating from the random initialization and the cross-task interaction. We thus propose a simple but effective technique, zeroing trick, to alleviate the interference. Extensive experiments are conducted on both mini-ImageNet and Omniglot datasets to demonstrate the consistent improvement brought by our proposed method, validating its effectiveness.  ( 2 min )
    The Familiarity Hypothesis: Explaining the Behavior of Deep Open Set Methods. (arXiv:2203.02486v1 [cs.CV])
    In many object recognition applications, the set of possible categories is an open set, and the deployed recognition system will encounter novel objects belonging to categories unseen during training. Detecting such "novel category" objects is usually formulated as an anomaly detection problem. Anomaly detection algorithms for feature-vector data identify anomalies as outliers, but outlier detection has not worked well in deep learning. Instead, methods based on the computed logits of visual object classifiers give state-of-the-art performance. This paper proposes the Familiarity Hypothesis that these methods succeed because they are detecting the absence of familiar learned features rather than the presence of novelty. The paper reviews evidence from the literature and presents additional evidence from our own experiments that provide strong support for this hypothesis. The paper concludes with a discussion of whether familiarity detection is an inevitable consequence of representation learning.  ( 2 min )
    Behavioural Curves Analysis Using Near-Infrared-Iris Image Sequences. (arXiv:2203.02488v1 [cs.CV])
    This paper proposes a new method to estimate behavioural curves from a stream of Near-Infra-Red (NIR) iris video frames. This method can be used in a Fitness For Duty system (FFD). The research focuses on determining the effect of external factors such as alcohol, drugs, and sleepiness on the Central Nervous System (CNS). The aim is to analyse how this behaviour is represented on iris and pupil movements and if it is possible to capture these changes with a standard NIR camera. The behaviour analysis showed essential differences in pupil and iris behaviour to classify the workers in "Fit" or "Unfit" conditions. The best results can distinguish subjects robustly under alcohol, drug consumption, and sleep conditions. The Multi-Layer-Perceptron and Gradient Boosted Machine reached the best results in all groups with an overall accuracy for Fit and Unfit classes of 74.0% and 75.5%, respectively. These results open a new application for iris capture devices.  ( 2 min )
    CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric. (arXiv:2110.10522v2 [cs.LG] UPDATED)
    As an algorithm based on deep reinforcement learning, Proximal Policy Optimization (PPO) performs well in many complex tasks and has become one of the most popular RL algorithms in recent years. According to the mechanism of penalty in surrogate objective, PPO can be divided into PPO with KL Divergence (KL-PPO) and PPO with Clip function(Clip-PPO). Clip-PPO is widely used in a variety of practical scenarios and has attracted the attention of many researchers. Therefore, many variations have also been created, making the algorithm better and better. However, as a more theoretical algorithm, KL-PPO was neglected because its performance was not as good as CliP-PPO. In this article, we analyze the asymmetry effect of KL divergence on PPO's objective function , and give the inequality that can indicate when the asymmetry will affect the efficiency of KL-PPO. Proposed PPO with Correntropy Induced Metric algorithm(CIM-PPO) that use the theory of correntropy(a symmetry metric method that was widely used in M-estimation to evaluate two distributions' difference)and applied it in PPO. Then, we designed experiments based on OpenAIgym to test the effectiveness of the new algorithm and compare it with KL-PPO and CliP-PPO.
    Does anatomical contextual information improve 3D U-Net based brain tumor segmentation?. (arXiv:2010.13460v3 [eess.IV] UPDATED)
    Effective, robust, and automatic tools for brain tumor segmentation are needed for the extraction of information useful in treatment planning from magnetic resonance (MR) images. Context-aware artificial intelligence is an emerging concept for the development of deep learning applications for computer-aided medical image analysis. In this work, it is investigated whether the addition of contextual information from the brain anatomy in the form of white matter, gray matter, and cerebrospinal fluid masks and probability maps improves U-Net-based brain tumor segmentation. The BraTS2020 dataset was used to train and test two standard 3D U-Net models that, in addition to the conventional MR image modalities, used the anatomical contextual information as extra channels in the form of binary masks (CIM) or probability maps (CIP). A baseline model (BLM) that only used the conventional MR image modalities was also trained. The impact of adding contextual information was investigated in terms of overall segmentation accuracy, model training time, domain generalization, and compensation for fewer MR modalities available for each subject. Results show that there is no statistically significant difference when comparing Dice scores between the baseline model and the contextual information models, even when comparing performances for high- and low-grade tumors independently. Only in the case of compensation for fewer MR modalities available for each subject did the addition of anatomical contextual information significantly improve the segmentation of the whole tumor. Overall, there is no overall significant improvement in segmentation performance when using anatomical contextual information in the form of either binary masks or probability maps as extra channels.
    Physics-Guided Deep Learning for Dynamical Systems: A Survey. (arXiv:2107.01272v5 [cs.LG] UPDATED)
    Modeling complex physical dynamics is a fundamental task in science and engineering. Traditional physics-based models are sample efficient, interpretable but often rely on rigid assumptions. Furthermore, direct numerical approximation is usually computationally intensive, requiring significant computational resources and expertise. While deep learning (DL) provides novel alternatives for efficiently recognizing complex patterns and emulating nonlinear dynamics, its predictions do not necessarily obey the governing laws of physical systems, nor do they generalize well across different systems. Thus, the study of physics-guided DL emerged and has gained great progress. Physics-guided DL aims to take the best from both physics-based modeling and state-of-the-art DL models to better solve scientific problems. In this paper, we provide a structured overview of existing methodologies of integrating prior physical knowledge or physics-based modeling into DL, with a special emphasis on learning dynamical systems. We also discuss the fundamental challenges and emerging opportunities in the area.
    Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. (arXiv:2104.00629v2 [stat.ML] UPDATED)
    Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass- classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.
    3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces. (arXiv:2111.12448v3 [cs.CV] UPDATED)
    Learning a disentangled, interpretable, and structured latent representation in 3D generative models of faces and bodies is still an open problem. The problem is particularly acute when control over identity features is required. In this paper, we propose an intuitive yet effective self-supervised approach to train a 3D shape variational autoencoder (VAE) which encourages a disentangled latent representation of identity features. Curating the mini-batch generation by swapping arbitrary features across different shapes allows to define a loss function leveraging known differences and similarities in the latent representations. Experimental results conducted on 3D meshes show that state-of-the-art methods for latent disentanglement are not able to disentangle identity features of faces and bodies. Our proposed method properly decouples the generation of such features while maintaining good representation and reconstruction capabilities.
    Mixed Reality Depth Contour Occlusion Using Binocular Similarity Matching and Three-dimensional Contour Optimisation. (arXiv:2203.02300v1 [cs.CV])
    Mixed reality applications often require virtual objects that are partly occluded by real objects. However, previous research and commercial products have limitations in terms of performance and efficiency. To address these challenges, we propose a novel depth contour occlusion (DCO) algorithm. The proposed method is based on the sensitivity of contour occlusion and a binocular stereoscopic vision device. In this method, a depth contour map is combined with a sparse depth map obtained from a two-stage adaptive filter area stereo matching algorithm and the depth contour information of the objects extracted by a digital image stabilisation optical flow method. We also propose a quadratic optimisation model with three constraints to generate an accurate dense map of the depth contour for high-quality real-virtual occlusion. The whole process is accelerated by GPU. To evaluate the effectiveness of the algorithm, we demonstrate a time con-sumption statistical analysis for each stage of the DCO algorithm execution. To verify the relia-bility of the real-virtual occlusion effect, we conduct an experimental analysis on single-sided, enclosed, and complex occlusions; subsequently, we compare it with the occlusion method without quadratic optimisation. With our GPU implementation for real-time DCO, the evaluation indicates that applying the presented DCO algorithm can enhance the real-time performance and the visual quality of real-virtual occlusion.  ( 2 min )
    Quantum Approximate Optimization Algorithm for Bayesian network structure learning. (arXiv:2203.02400v1 [quant-ph])
    Bayesian network structure learning is an NP-hard problem that has been faced by a number of traditional approaches in recent decades. Currently, quantum technologies offer a wide range of advantages that can be exploited to solve optimization tasks that cannot be addressed in an efficient way when utilizing classic computing approaches. In this work, a specific type of variational quantum algorithm, the quantum approximate optimization algorithm, was used to solve the Bayesian network structure learning problem, by employing $3n(n-1)/2$ qubits, where $n$ is the number of nodes in the Bayesian network to be learned. Our results showed that the quantum approximate optimization algorithm approach offers competitive results with state-of-the-art methods and quantitative resilience to quantum noise. The approach was applied to a cancer benchmark problem, and the results justified the use of variational quantum algorithms for solving the Bayesian network structure learning problem.  ( 2 min )
    Elements of Sequential Monte Carlo. (arXiv:1903.04797v2 [stat.ML] UPDATED)
    A core problem in statistics and probabilistic machine learning is to compute probability distributions and expectations. This is the fundamental problem of Bayesian statistics and machine learning, which frames all inference as expectations with respect to the posterior distribution. The key challenge is to approximate these intractable expectations. In this tutorial, we review sequential Monte Carlo (SMC), a random-sampling-based class of methods for approximate inference. First, we explain the basics of SMC, discuss practical issues, and review theoretical results. We then examine two of the main user design choices: the proposal distributions and the so called intermediate target distributions. We review recent results on how variational inference and amortization can be used to learn efficient proposals and target distributions. Next, we discuss the SMC estimate of the normalizing constant, how this can be used for pseudo-marginal inference and inference evaluation. Throughout the tutorial we illustrate the use of SMC on various models commonly used in machine learning, such as stochastic recurrent neural networks, probabilistic graphical models, and probabilistic programs.
    Rethinking Efficient Lane Detection via Curve Modeling. (arXiv:2203.02431v1 [cs.CV])
    This paper presents a novel parametric curve-based method for lane detection in RGB images. Unlike state-of-the-art segmentation-based and point detection-based methods that typically require heuristics to either decode predictions or formulate a large sum of anchors, the curve-based methods can learn holistic lane representations naturally. To handle the optimization difficulties of existing polynomial curve methods, we propose to exploit the parametric B\'ezier curve due to its ease of computation, stability, and high freedom degrees of transformations. In addition, we propose the deformable convolution-based feature flip fusion, for exploiting the symmetry properties of lanes in driving scenes. The proposed method achieves a new state-of-the-art performance on the popular LLAMAS benchmark. It also achieves favorable accuracy on the TuSimple and CULane datasets, while retaining both low latency (> 150 FPS) and small model size (< 10M). Our method can serve as a new baseline, to shed the light on the parametric curves modeling for lane detection. Codes of our model and PytorchAutoDrive: a unified framework for self-driving perception, are available at: https://github.com/voldemortX/pytorch-auto-drive .  ( 2 min )
    Do Explanations Explain? Model Knows Best. (arXiv:2203.02269v1 [cs.LG])
    It is a mystery which input features contribute to a neural network's output. Various explanation (feature attribution) methods are proposed in the literature to shed light on the problem. One peculiar observation is that these explanations (attributions) point to different features as being important. The phenomenon raises the question, which explanation to trust? We propose a framework for evaluating the explanations using the neural network model itself. The framework leverages the network to generate input features that impose a particular behavior on the output. Using the generated features, we devise controlled experimental setups to evaluate whether an explanation method conforms to an axiom. Thus we propose an empirical framework for axiomatic evaluation of explanation methods. We evaluate well-known and promising explanation solutions using the proposed framework. The framework provides a toolset to reveal properties and drawbacks within existing and future explanation solutions.  ( 2 min )
    DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations. (arXiv:2203.02013v1 [cs.LG])
    The ability for a human to understand an Artificial Intelligence (AI) model's decision-making process is critical in enabling stakeholders to visualize model behavior, perform model debugging, promote trust in AI models, and assist in collaborative human-AI decision-making. As a result, the research fields of interpretable and explainable AI have gained traction within AI communities as well as interdisciplinary scientists seeking to apply AI in their subject areas. In this paper, we focus on advancing the state-of-the-art in interpreting multimodal models - a class of machine learning methods that tackle core challenges in representing and capturing interactions between heterogeneous data sources such as images, text, audio, and time-series data. Multimodal models have proliferated numerous real-world applications across healthcare, robotics, multimedia, affective computing, and human-computer interaction. By performing model disentanglement into unimodal contributions (UC) and multimodal interactions (MI), our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models while maintaining generality across arbitrary modalities, model architectures, and tasks. Through a comprehensive suite of experiments on both synthetic and real-world multimodal tasks, we show that DIME generates accurate disentangled explanations, helps users of multimodal models gain a deeper understanding of model behavior, and presents a step towards debugging and improving these models for real-world deployment. Code for our experiments can be found at https://github.com/lvyiwei1/DIME.  ( 2 min )
    Graph clustering with Boltzmann machines. (arXiv:2203.02471v1 [cs.LG])
    Graph clustering is the process of grouping vertices into densely connected sets called clusters. We tailor two mathematical programming formulations from the literature, to this problem. In doing so, we obtain a heuristic approximation to the intra-cluster density maximization problem. We use two variations of a Boltzmann machine heuristic to obtain numerical solutions. For benchmarking purposes, we compare solution quality and computational performances to those obtained using a commercial solver, Gurobi. We also compare clustering quality to the clusters obtained using the popular Louvain modularity maximization method. Our initial results clearly demonstrate the superiority of our problem formulations. They also establish the superiority of the Boltzmann machine over the traditional exact solver. In the case of smaller less complex graphs, Boltzmann machines provide the same solutions as Gurobi, but with solution times that are orders of magnitude lower. In the case of larger and more complex graphs, Gurobi fails to return meaningful results within a reasonable time frame. Finally, we also note that both our clustering formulations, the distance minimization and $K$-medoids, yield clusters of superior quality to those obtained with the Louvain algorithm.  ( 2 min )
    In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis. (arXiv:2203.02116v1 [cs.CL])
    One of the burning problems lately in Japan has been cyber-bullying, or slandering and bullying people online. The problem has been especially noticed on unofficial Web sites of Japanese schools. Volunteers consisting of school personnel and PTA (Parent-Teacher Association) members have started Online Patrol to spot malicious contents within Web forums and blogs. In practise, Online Patrol assumes reading through the whole Web contents, which is a task difficult to perform manually. With this paper we introduce a research intended to help PTA members perform Online Patrol more efficiently. We aim to develop a set of tools that can automatically detect malicious entries and report them to PTA members. First, we collected cyber-bullying data from unofficial school Web sites. Then we performed analysis of this data in two ways. Firstly, we analysed the entries with a multifaceted affect analysis system in order to find distinctive features for cyber-bullying and apply them to a machine learning classifier. Secondly, we applied a SVM based machine learning method to train a classifier for detection of cyber-bullying. The system was able to classify cyber-bullying entries with 88.2% of balanced F-score.  ( 2 min )
    Intrinsically-Motivated Reinforcement Learning: A Brief Introduction. (arXiv:2203.02298v1 [cs.LG])
    Reinforcement learning (RL) is one of the three basic paradigms of machine learning. It has demonstrated impressive performance in many complex tasks like Go and StarCraft, which is increasingly involved in smart manufacturing and autonomous driving. However, RL consistently suffers from the exploration-exploitation dilemma. In this paper, we investigated the problem of improving exploration in RL and introduced the intrinsically-motivated RL. In sharp contrast to the classic exploration strategies, intrinsically-motivated RL utilizes the intrinsic learning motivation to provide sustainable exploration incentives. We carefully classified the existing intrinsic reward methods and analyzed their practical drawbacks. Moreover, we proposed a new intrinsic reward method via R\'enyi state entropy maximization, which overcomes the drawbacks of the preceding methods and provides powerful exploration incentives. Finally, extensive simulation demonstrated that the proposed module achieve superior performance with higher efficiency and robustness.  ( 2 min )
    Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression. (arXiv:2203.02452v1 [eess.IV])
    Entropy modeling is a key component for high-performance image compression algorithms. Recent developments in autoregressive context modeling helped learning-based methods to surpass their classical counterparts. However, the performance of those models can be further improved due to the underexploited spatio-channel dependencies in latent space, and the suboptimal implementation of context adaptivity. Inspired by the adaptive characteristics of the transformers, we propose a transformer-based context model, a.k.a. Contextformer, which generalizes the de facto standard attention mechanism to spatio-channel attention. We replace the context model of a modern compression framework with the Contextformer and test it on the widely used Kodak image dataset. Our experimental results show that the proposed model provides up to 10% rate savings compared to the standard Versatile Video Coding (VVC) Test Model (VTM) 9.1, and outperforms various learning-based models.  ( 2 min )
    Contextual Attention Network: Transformer Meets U-Net. (arXiv:2203.01932v1 [eess.IV])
    Currently, convolutional neural networks (CNN) (e.g., U-Net) have become the de facto standard and attained immense success in medical image segmentation. However, as a downside, CNN based methods are a double-edged sword as they fail to build long-range dependencies and global context connections due to the limited receptive field that stems from the intrinsic characteristics of the convolution operation. Hence, recent articles have exploited Transformer variants for medical image segmentation tasks which open up great opportunities due to their innate capability of capturing long-range correlations through the attention mechanism. Although being feasibly designed, most of the cohort studies incur prohibitive performance in capturing local information, thereby resulting in less lucidness of boundary areas. In this paper, we propose a contextual attention network to tackle the aforementioned limitations. The proposed method uses the strength of the Transformer module to model the long-range contextual dependency. Simultaneously, it utilizes the CNN encoder to capture local semantic information. In addition, an object-level representation is included to model the regional interaction map. The extracted hierarchical features are then fed to the contextual attention module to adaptively recalibrate the representation space using the local information. Then, they emphasize the informative regions while taking into account the long-range contextual dependency derived by the Transformer module. We validate our method on several large-scale public medical image segmentation datasets and achieve state-of-the-art performance. We have provided the implementation code in https://github.com/rezazad68/TMUnet.  ( 2 min )
    Momentum Accelerates the Convergence of Stochastic AUPRC Maximization. (arXiv:2107.01173v2 [cs.LG] UPDATED)
    In this paper, we study stochastic optimization of areas under precision-recall curves (AUPRC), which is widely used for combating imbalanced classification tasks. Although a few methods have been proposed for maximizing AUPRC, stochastic optimization of AUPRC with convergence guarantee remains an undeveloped territory. A state-of-the-art complexity is $O(1/\epsilon^5)$ for finding an $\epsilon$-stationary solution. In this paper, we further improve the stochastic optimization of AURPC by (i) developing novel stochastic momentum methods with a better iteration complexity of $O(1/\epsilon^4)$ for finding an $\epsilon$-stationary solution; and (ii) designing a novel family of stochastic adaptive methods with the same iteration complexity, which enjoy faster convergence in practice. To this end, we propose two innovative techniques that are critical for improving the convergence: (i) the biased estimators for tracking individual ranking scores are updated in a randomized coordinate-wise manner; and (ii) a momentum update is used on top of the stochastic gradient estimator for tracking the gradient of the objective. The novel analysis of Adam-style updates is also one main contribution. Extensive experiments on various data sets demonstrate the effectiveness of the proposed algorithms. Of independent interest, the proposed stochastic momentum and adaptive algorithms are also applicable to a class of two-level stochastic dependent compositional optimization problems.  ( 2 min )
    Boosting the Performance of Quantum Annealers using Machine Learning. (arXiv:2203.02360v1 [quant-ph])
    Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address the issue of intrinsic imperfections with a novel error correction approach, based on machine learning methods. Our approach adjusts the input Hamiltonian to maximize the probability of finding the solution. In our experiments, the proposed error correction method improved the performance of annealing by up to three orders of magnitude and enabled the solving of a previously intractable, maximally complex problem.  ( 2 min )
    Ontological Learning from Weak Labels. (arXiv:2203.02483v1 [cs.SD])
    Ontologies encompass a formal representation of knowledge through the definition of concepts or properties of a domain, and the relationships between those concepts. In this work, we seek to investigate whether using this ontological information will improve learning from weakly labeled data, which are easier to collect since it requires only the presence or absence of an event to be known. We use the AudioSet ontology and dataset, which contains audio clips weakly labeled with the ontology concepts and the ontology providing the "Is A" relations between the concepts. We first re-implemented the model proposed by soundevent_ontology with modification to fit the multi-label scenario and then expand on that idea by using a Graph Convolutional Network (GCN) to model the ontology information to learn the concepts. We find that the baseline Siamese does not perform better by incorporating ontology information in the weak and multi-label scenario, but that the GCN does capture the ontology knowledge better for weak, multi-labeled data. In our experiments, we also investigate how different modules can tolerate noises introduced from weak labels and better incorporate ontology information. Our best Siamese-GCN model achieves mAP=0.45 and AUC=0.87 for lower-level concepts and mAP=0.72 and AUC=0.86 for higher-level concepts, which is an improvement over the baseline Siamese but about the same as our models that do not use ontology information.  ( 2 min )
    A flow-based IDS using Machine Learning in eBPF. (arXiv:2102.09980v3 [cs.CR] UPDATED)
    eBPF is a new technology which allows dynamically loading pieces of code into the Linux kernel. It can greatly speed up networking since it enables the kernel to process certain packets without the involvement of a userspace program. So far eBPF has been used for simple packet filtering applications such as firewalls or Denial of Service protection. We show that it is possible to develop a flow based network intrusion detection system based on machine learning entirely in eBPF. Our solution uses a decision tree and decides for each packet whether it is malicious or not, considering the entire previous context of the network flow. We achieve a performance increase of over 20% compared to the same solution implemented as a userspace program.  ( 2 min )
    Passive and Active Learning of Driver Behavior from Electric Vehicles. (arXiv:2203.02179v1 [cs.LG])
    Modeling driver behavior provides several advantages in the automotive industry, including prediction of electric vehicle energy consumption. Studies have shown that aggressive driving can consume up to 30% more energy than moderate driving, in certain driving scenarios. Machine learning methods are widely used for driver behavior classification, which, however, may yield some challenges such as sequence modeling on long time windows and lack of labeled data due to expensive annotation. To address the first challenge, passive learning of driver behavior, we investigate non-recurrent architectures such as self-attention models and convolutional neural networks with joint recurrence plots (JRP), and compare them with recurrent models. We find that self-attention models yield good performance, while JRP does not exhibit any significant improvement. However, with the window lengths of 5 and 10 seconds used in our study, none of the non-recurrent models outperform the recurrent models. To address the second challenge, we investigate several active learning methods with different informativeness measures. We evaluate uncertainty sampling, as well as more advanced methods, such as query by committee and active deep dropout. Our experiments demonstrate that some active sampling techniques can outperform random sampling, and therefore decrease the effort needed for annotation.  ( 2 min )
    ViNMT: Neural Machine Translation Toolkit. (arXiv:2112.15272v3 [cs.CL] UPDATED)
    We present an open-source toolkit for neural machine translation (NMT). The new toolkit is mainly based on vaulted Transformer (Vaswani et al., 2017) along with many other improvements detailed below, in order to create a self-contained, simple to use, consistent and comprehensive framework for Machine Translation tasks of various domains. It is tooled to support both bilingual and multilingual translation tasks, starting from building the model from respective corpora, to inferring new predictions or packaging the model to serving-capable JIT format.  ( 2 min )
    A multi-stream convolutional neural network for classification of progressive MCI in Alzheimer's disease using structural MRI images. (arXiv:2203.01944v1 [eess.IV])
    Early diagnosis of Alzheimer's disease and its prodromal stage, also known as mild cognitive impairment (MCI), is critical since some patients with progressive MCI will develop the disease. We propose a multi-stream deep convolutional neural network fed with patch-based imaging data to classify stable MCI and progressive MCI. First, we compare MRI images of Alzheimer's disease with cognitively normal subjects to identify distinct anatomical landmarks using a multivariate statistical test. These landmarks are then used to extract patches that are fed into the proposed multi-stream convolutional neural network to classify MRI images. Next, we train the architecture in a separate scenario using samples from Alzheimer's disease images, which are anatomically similar to the progressive MCI ones and cognitively normal images to compensate for the lack of progressive MCI training data. Finally, we transfer the trained model weights to the proposed architecture in order to fine-tune the model using progressive MCI and stable MCI data. Experimental results on the ADNI-1 dataset indicate that our method outperforms existing methods for MCI classification, with an F1-score of 85.96%.  ( 2 min )
    Dynamic Backdoors with Global Average Pooling. (arXiv:2203.02079v1 [cs.CR])
    Outsourced training and machine learning as a service have resulted in novel attack vectors like backdoor attacks. Such attacks embed a secret functionality in a neural network activated when the trigger is added to its input. In most works in the literature, the trigger is static, both in terms of location and pattern. The effectiveness of various detection mechanisms depends on this property. It was recently shown that countermeasures in image classification, like Neural Cleanse and ABS, could be bypassed with dynamic triggers that are effective regardless of their pattern and location. Still, such backdoors are demanding as they require a large percentage of poisoned training data. In this work, we are the first to show that dynamic backdoor attacks could happen due to a global average pooling layer without increasing the percentage of the poisoned training data. Nevertheless, our experiments in sound classification, text sentiment analysis, and image classification show this to be very difficult in practice.  ( 2 min )
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v1 [stat.ML])
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.  ( 2 min )
    Learning Neural Networks on SVD Boosted Latent Spaces for Semantic Classification. (arXiv:2101.00563v1 [cs.LG] CROSS LISTED)
    The availability of large amounts of data and compelling computation power have made deep learning models much popular for text classification and sentiment analysis. Deep neural networks have achieved competitive performance on the above tasks when trained on naive text representations such as word count, term frequency, and binary matrix embeddings. However, many of the above representations result in the input space having a dimension of the order of the vocabulary size, which is enormous. This leads to a blow-up in the number of parameters to be learned, and the computational cost becomes infeasible when scaling to domains that require retaining a colossal vocabulary. This work proposes using singular value decomposition to transform the high dimensional input space to a lower-dimensional latent space. We show that neural networks trained on this lower-dimensional space are not only able to retain performance while savoring significant reduction in the computational complexity but, in many situations, also outperforms the classical neural networks trained on the native input space.  ( 2 min )
    Compressed Predictive Information Coding. (arXiv:2203.02051v1 [cs.LG])
    Unsupervised learning plays an important role in many fields, such as artificial intelligence, machine learning, and neuroscience. Compared to static data, methods for extracting low-dimensional structure for dynamic data are lagging. We developed a novel information-theoretic framework, Compressed Predictive Information Coding (CPIC), to extract useful representations from dynamic data. CPIC selectively projects the past (input) into a linear subspace that is predictive about the compressed data projected from the future (output). The key insight of our framework is to learn representations by minimizing the compression complexity and maximizing the predictive information in latent space. We derive variational bounds of the CPIC loss which induces the latent space to capture information that is maximally predictive. Our variational bounds are tractable by leveraging bounds of mutual information. We find that introducing stochasticity in the encoder robustly contributes to better representation. Furthermore, variational approaches perform better in mutual information estimation compared with estimates under a Gaussian assumption. We demonstrate that CPIC is able to recover the latent space of noisy dynamical systems with low signal-to-noise ratios, and extracts features predictive of exogenous variables in neuroscience data.  ( 2 min )
    Pattern Based Multivariable Regression using Deep Learning (PBMR-DP). (arXiv:2202.13541v2 [cs.CV] UPDATED)
    We propose a deep learning methodology for multivariate regression that is based on pattern recognition that triggers fast learning over sensor data. We used a conversion of sensors-to-image which enables us to take advantage of Computer Vision architectures and training processes. In addition to this data preparation methodology, we explore the use of state-of-the-art architectures to generate regression outputs to predict agricultural crop continuous yield information. Finally, we compare with some of the top models reported in MLCAS2021. We found that using a straightforward training process, we were able to accomplish an MAE of 4.394, RMSE of 5.945, and R^2 of 0.861.  ( 2 min )
    Speech watermarking: a solution for authentication of forensic audio digital recordings. (arXiv:2203.02275v1 [cs.CR])
    In this paper we discuss the problem of authentication of forensic audio when using digital recordings. Although forensic audio has been addressed in several papers the existing approaches are focused on analog magnetic recordings, which are becoming old-fashion due to the large amount of digital recorders available on the market (optical, solid-state, hard disks, etc). We present an approach based on digital signal processing that consist of spread spectrum techniques for speech watermarking. This approach presents the advantage that the authentication is based on the signal itself rather than the recording support. Thus, it is valid for whatever recording device. In addition, our proposal permits the introduction of relevant information such as recording date and time and all the relevant data (this is not possible with classical systems). Our experimental results reveal that the speech watermarking procedure does not interfere in a significant way with the posterior forensic speaker identification.  ( 2 min )
    On the Existence of the Adversarial Bayes Classifier (Extended Version). (arXiv:2112.01694v2 [cs.LG] UPDATED)
    Adversarial robustness is a critical property in a variety of modern machine learning applications. While it has been the subject of several recent theoretical studies, many important questions related to adversarial robustness are still open. In this work, we study a fundamental question regarding Bayes optimality for adversarial robustness. We provide general sufficient conditions under which the existence of a Bayes optimal classifier can be guaranteed for adversarial robustness. Our results can provide a useful tool for a subsequent study of surrogate losses in adversarial robustness and their consistency properties. This manuscript is the extended version of the paper "On the Existence of the Adversarial Bayes Classifier" published in NeurIPS. The results of the original paper did not apply to some non-strictly convex norms. Here we extend our results to all possible norms. Additionally, we clarify a missing step in one of our proofs.  ( 2 min )
    An Improved Frequent Directions Algorithm for Low-Rank Approximation via Block Krylov Iteration. (arXiv:2109.11703v2 [cs.LG] UPDATED)
    Frequent Directions, as a deterministic matrix sketching technique, has been proposed for tackling low-rank approximation problems. This method has a high degree of accuracy and practicality, but experiences a lot of computational cost for large-scale data. Several recent works on the randomized version of Frequent Directions greatly improve the computational efficiency, but unfortunately sacrifice some precision. To remedy such issue, this paper aims to find a more accurate projection subspace to further improve the efficiency and effectiveness of the existing Frequent Directions techniques. Specifically, by utilizing the power of Block Krylov Iteration and random projection technique, this paper presents a fast and accurate Frequent Directions algorithm named as r-BKIFD. The rigorous theoretical analysis shows that the proposed r-BKIFD has a comparable error bound with original Frequent Directions, and the approximation error can be arbitrarily small when the number of iterations is chosen appropriately. Extensive experimental results on both synthetic and real data further demonstrate the superiority of r-BKIFD over several popular Frequent Directions algorithms, both in terms of computational efficiency and accuracy.  ( 2 min )
    Stress Testing of Meta-learning Approaches for Few-shot Learning. (arXiv:2101.08587v1 [cs.LG] CROSS LISTED)
    Meta-learning (ML) has emerged as a promising learning method under resource constraints such as few-shot learning. ML approaches typically propose a methodology to learn generalizable models. In this work-in-progress paper, we put the recent ML approaches to a stress test to discover their limitations. Precisely, we measure the performance of ML approaches for few-shot learning against increasing task complexity. Our results show a quick degradation in the performance of initialization strategies for ML (MAML, TAML, and MetaSGD), while surprisingly, approaches that use an optimization strategy (MetaLSTM) perform significantly better. We further demonstrate the effectiveness of an optimization strategy for ML (MetaLSTM++) trained in a MAML manner over a pure optimization strategy. Our experiments also show that the optimization strategies for ML achieve higher transferability from simple to complex tasks.  ( 2 min )
    Salvaging Federated Learning by Local Adaptation. (arXiv:2002.04758v3 [cs.LG] UPDATED)
    Federated learning (FL) is a heavily promoted approach for training ML models on sensitive data, e.g., text typed by users on their smartphones. FL is expressly designed for training on data that are unbalanced and non-iid across the participants. To ensure privacy and integrity of the fedeated model, latest FL approaches use differential privacy or robust aggregation. We look at FL from the \emph{local} viewpoint of an individual participant and ask: (1) do participants have an incentive to participate in FL? (2) how can participants \emph{individually} improve the quality of their local models, without re-designing the FL framework and/or involving other participants? First, we show that on standard tasks such as next-word prediction, many participants gain no benefit from FL because the federated model is less accurate on their data than the models they can train locally on their own. Second, we show that differential privacy and robust aggregation make this problem worse by further destroying the accuracy of the federated model for many participants. Then, we evaluate three techniques for local adaptation of federated models: fine-tuning, multi-task learning, and knowledge distillation. We analyze where each is applicable and demonstrate that all participants benefit from local adaptation. Participants whose local models are poor obtain big accuracy improvements over conventional FL. Participants whose local models are better than the federated model\textemdash and who have no incentive to participate in FL today\textemdash improve less, but sufficiently to make the adapted federated model better than their local models.
    Is Importance Weighting Incompatible with Interpolating Classifiers?. (arXiv:2112.12986v2 [cs.LG] UPDATED)
    Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. Is importance weighting truly incompatible with the training of overparameterized neural networks? Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. When reweighted, our loss function can outperform reweighted cross-entropy by as much as 9% in test accuracy. Our loss function also gives test accuracies comparable to, or even exceeding, well-tuned state-of-the-art methods for correcting distribution shifts.
    Semantic-Aware Collaborative Deep Reinforcement Learning Over Wireless Cellular Networks. (arXiv:2111.12064v2 [cs.IT] UPDATED)
    Collaborative deep reinforcement learning (CDRL) algorithms in which multiple agents can coordinate over a wireless network is a promising approach to enable future intelligent and autonomous systems that rely on real-time decision-making in complex dynamic environments. Nonetheless, in practical scenarios, CDRL faces many challenges due to the heterogeneity of agents and their learning tasks, different environments, time constraints of the learning, and resource limitations of wireless networks. To address these challenges, in this paper, a novel semantic-aware CDRL method is proposed to enable a group of heterogeneous untrained agents with semantically-linked DRL tasks to collaborate efficiently across a resource-constrained wireless cellular network. To this end, a new heterogeneous federated DRL (HFDRL) algorithm is proposed to select the best subset of semantically relevant DRL agents for collaboration. The proposed approach then jointly optimizes the training loss and wireless bandwidth allocation for the cooperating selected agents in order to train each agent within the time limit of its real-time task. Simulation results show the superior performance of the proposed algorithm compared to state-of-the-art baselines.
    Semantic Text-to-Face GAN -ST^2FG. (arXiv:2107.10756v2 [cs.CV] UPDATED)
    Faces generated using generative adversarial networks (GANs) have reached unprecedented realism. These faces, also known as "Deep Fakes", appear as realistic photographs with very little pixel-level distortions. While some work has enabled the training of models that lead to the generation of specific properties of the subject, generating a facial image based on a natural language description has not been fully explored. For security and criminal identification, the ability to provide a GAN-based system that works like a sketch artist would be incredibly useful. In this paper, we present a novel approach to generate facial images from semantic text descriptions. The learned model is provided with a text description and an outline of the type of face, which the model uses to sketch the features. Our models are trained using an Affine Combination Module (ACM) mechanism to combine the text embedding from BERT and the GAN latent space using a self-attention matrix. This avoids the loss of features due to inadequate "attention", which may happen if text embedding and latent vector are simply concatenated. Our approach is capable of generating images that are very accurately aligned to the exhaustive textual descriptions of faces with many fine detail features of the face and helps in generating better images. The proposed method is also capable of making incremental changes to a previously generated image if it is provided with additional textual descriptions or sentences.
    Do Vision Transformers See Like Convolutional Neural Networks?. (arXiv:2108.08810v2 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.
    AutoMO-Mixer: An automated multi-objective Mixer model for balanced, safe and robust prediction in medicine. (arXiv:2203.02384v1 [eess.IV])
    Accurately identifying patient's status through medical images plays an important role in diagnosis and treatment. Artificial intelligence (AI), especially the deep learning, has achieved great success in many fields. However, more reliable AI model is needed in image guided diagnosis and therapy. To achieve this goal, developing a balanced, safe and robust model with a unified framework is desirable. In this study, a new unified model termed as automated multi-objective Mixer (AutoMO-Mixer) model was developed, which utilized a recent developed multiple layer perceptron Mixer (MLP-Mixer) as base. To build a balanced model, sensitivity and specificity were considered as the objective functions simultaneously in training stage. Meanwhile, a new evidential reasoning based on entropy was developed to achieve a safe and robust model in testing stage. The experiment on an optical coherence tomography dataset demonstrated that AutoMO-Mixer can obtain safer, more balanced, and robust results compared with MLP-Mixer and other available models.
    Evaluating Local Model-Agnostic Explanations of Learning to Rank Models with Decision Paths. (arXiv:2203.02295v1 [stat.ML])
    Local explanations of learning-to-rank (LTR) models are thought to extract the most important features that contribute to the ranking predicted by the LTR model for a single data point. Evaluating the accuracy of such explanations is challenging since the ground truth feature importance scores are not available for most modern LTR models. In this work, we propose a systematic evaluation technique for explanations of LTR models. Instead of using black-box models, such as neural networks, we propose to focus on tree-based LTR models, from which we can extract the ground truth feature importance scores using decision paths. Once extracted, we can directly compare the ground truth feature importance scores to the feature importance scores generated with explanation techniques. We compare two recently proposed explanation techniques for LTR models and benchmark them using decision trees and gradient boosting models on the MQ2008 dataset. We show that neither of the explanation techniques can achieve an acceptable explanation accuracy when the chosen similarity metric is AUC score or Spearman's rank correlation.
    Differentiable Causal Discovery Under Latent Interventions. (arXiv:2203.02336v1 [cs.LG])
    Recent work has shown promising results in causal discovery by leveraging interventional data with gradient-based methods, even when the intervened variables are unknown. However, previous work assumes that the correspondence between samples and interventions is known, which is often unrealistic. We envision a scenario with an extensive dataset sampled from multiple intervention distributions and one observation distribution, but where we do not know which distribution originated each sample and how the intervention affected the system, \textit{i.e.}, interventions are entirely latent. We propose a method based on neural networks and variational inference that addresses this scenario by framing it as learning a shared causal graph among an infinite mixture (under a Dirichlet process prior) of intervention structural causal models. Experiments with synthetic and real data show that our approach and its semi-supervised variant are able to discover causal relations in this challenging scenario.
    Neural Simulated Annealing. (arXiv:2203.02201v1 [cs.LG])
    Simulated annealing (SA) is a stochastic global optimisation technique applicable to a wide range of discrete and continuous variable problems. Despite its simplicity, the development of an effective SA optimiser for a given problem hinges on a handful of carefully handpicked components; namely, neighbour proposal distribution and temperature annealing schedule. In this work, we view SA from a reinforcement learning perspective and frame the proposal distribution as a policy, which can be optimised for higher solution quality given a fixed computational budget. We demonstrate that this Neural SA with such a learnt proposal distribution, parametrised by small equivariant neural networks, outperforms SA baselines on a number of problems: Rosenbrock's function, the Knapsack problem, the Bin Packing problem, and the Travelling Salesperson problem. We also show that Neural SA scales well to large problems - generalising to significantly larger problems than the ones seen during training - while achieving comparable performance to popular off-the-shelf solvers and other machine learning methods in terms of solution quality and wall-clock time.
    Adaptive Discounting of Implicit Language Models in RNN-Transducers. (arXiv:2203.02317v1 [cs.CL])
    RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
    Exploring Scalable, Distributed Real-Time Anomaly Detection for Bridge Health Monitoring. (arXiv:2203.02380v1 [cs.NI])
    Modern real-time Structural Health Monitoring systems can generate a considerable amount of information that must be processed and evaluated for detecting early anomalies and generating prompt warnings and alarms about the civil infrastructure conditions. The current cloud-based solutions cannot scale if the raw data has to be collected from thousands of buildings. This paper presents a full-stack deployment of an efficient and scalable anomaly detection pipeline for SHM systems which does not require sending raw data to the cloud but relies on edge computation. First, we benchmark three algorithmic approaches of anomaly detection, i.e., Principal Component Analysis (PCA), Fully-Connected AutoEncoder (FC-AE), and Convolutional AutoEncoder (C-AE). Then, we deploy them on an edge-sensor, the STM32L4, with limited computing capabilities. Our approach decreases network traffic by $\approx8\cdot10^5\times$ , from 780KB/hour to less than 10 Bytes/hour for a single installation and minimize network and cloud resource utilization, enabling the scaling of the monitoring infrastructure. A real-life case study, a highway bridge in Italy, demonstrates that combining near-sensor computation of anomaly detection algorithms, smart pre-processing, and low-power wide-area network protocols (LPWAN) we can greatly reduce data communication and cloud computing costs, while anomaly detection accuracy is not adversely affected.
    Private Identity Testing for High-Dimensional Distributions. (arXiv:1905.11947v3 [cs.DS] UPDATED)
    In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.
    Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples. (arXiv:1912.04853v3 [cs.HC] UPDATED)
    Embeddings mapping high-dimensional discrete input to lower-dimensional continuous vector spaces have been widely adopted in machine learning applications as a way to capture domain semantics. Interviewing 13 embedding users across disciplines, we find comparing embeddings is a key task for deployment or downstream analysis but unfolds in a tedious fashion that poorly supports systematic exploration. In response, we present the Embedding Comparator, an interactive system that presents a global comparison of embedding spaces alongside fine-grained inspection of local neighborhoods. It systematically surfaces points of comparison by computing the similarity of the $k$-nearest neighbors of every embedded object between a pair of spaces. Through case studies across multiple modalities, we demonstrate our system rapidly reveals insights, such as semantic changes following fine-tuning, language changes over time, and differences between seemingly similar models. In evaluations with 15 participants, we find our system accelerates comparisons by shifting from laborious manual specification to browsing and manipulating visualizations.
    Contrastive Graph Convolutional Networks for Hardware Trojan Detection in Third Party IP Cores. (arXiv:2203.02095v1 [cs.LG])
    The availability of wide-ranging third-party intellectual property (3PIP) cores enables integrated circuit (IC) designers to focus on designing high-level features in ASICs/SoCs. The massive proliferation of ICs brings with it an increased number of bad actors seeking to exploit those circuits for various nefarious reasons. This is not surprising as integrated circuits affect every aspect of society. Thus, malicious logic (Hardware Trojans, HT) being surreptitiously injected by untrusted vendors into 3PIP cores used in IC design is an ever present threat. In this paper, we explore methods for identification of trigger-based HT in designs containing synthesizable IP cores without a golden model. Specifically, we develop methods to detect hardware trojans by detecting triggers embedded in ICs purely based on netlists acquired from the vendor. We propose GATE-Net, a deep learning model based on graph-convolutional networks (GCN) trained using supervised contrastive learning, for flagging designs containing randomly-inserted triggers using only the corresponding netlist. Our proposed architecture achieves significant improvements over state-of-the-art learning models yielding an average 46.99% improvement in detection performance for combinatorial triggers and 21.91% improvement for sequential triggers across a variety of circuit types. Through rigorous experimentation, qualitative and quantitative performance evaluations, we demonstrate effectiveness of GATE-Net and the supervised contrastive training of GATE-Net for HT detection.  ( 2 min )
    A photonic chip-based machine learning approach for the prediction of molecular properties. (arXiv:2203.02285v1 [cs.ET])
    Machine learning methods have revolutionized the discovery process of new molecules and materials. However, the intensive training process of neural networks for molecules with ever increasing complexity has resulted in exponential growth in computation cost, leading to long simulation time and high energy consumption. Photonic chip technology offers an alternative platform for implementing neural network with faster data processing and lower energy usage compared to digital computers. Here, we demonstrate the capability of photonic neural networks in predicting the quantum mechanical properties of molecules. Additionally, we show that multiple properties can be learned simultaneously in a photonic chip via a multi-task regression learning algorithm, which we believe is the first of its kind, as most previous works focus on implementing a network for the task of classification. Photonics technology are also naturally capable of implementing complex-valued neural networks at no additional hardware cost and we show that such neural networks outperform conventional real-valued networks for molecular property prediction. Our work opens the avenue for harnessing photonic technology for large-scale machine learning applications in molecular sciences such as drug discovery and materials design.  ( 2 min )
    Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. (arXiv:2203.02053v1 [cs.CL])
    We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at https://modalitygap.readthedocs.io/  ( 2 min )
    Analysis of closed-loop inertial gradient dynamics. (arXiv:2203.02140v1 [math.OC])
    In this paper, we analyse the performance of the closed-loop Whiplash gradient descent algorithm for $L$-smooth convex cost functions. Using numerical experiments, we study the algorithm's performance for convex cost functions, for different condition numbers. We analyse the convergence of the momentum sequence using symplectic integration and introduce the concept of relaxation sequences which analyses the non-classical character of the whiplash method. Under the additional assumption of invexity, we establish a momentum-driven adaptive convergence rate. Furthermore, we introduce an energy method for predicting the convergence rate with convex cost functions for closed-loop inertial gradient dynamics, using an integral anchored energy function and a novel lower bound asymptotic notation, by exploiting the bounded nature of the solutions. Using this, we establish a polynomial convergence rate for the whiplash inertial gradient system, for a family of scalar quadratic cost functions and an exponential rate for a quadratic scalar cost function.  ( 2 min )
    Quality or Quantity: Toward a Unified Approach for Multi-organ Segmentation in Body CT. (arXiv:2203.01934v1 [eess.IV])
    Organ segmentation of medical images is a key step in virtual imaging trials. However, organ segmentation datasets are limited in terms of quality (because labels cover only a few organs) and quantity (since case numbers are limited). In this study, we explored the tradeoffs between quality and quantity. Our goal is to create a unified approach for multi-organ segmentation of body CT, which will facilitate the creation of large numbers of accurate virtual phantoms. Initially, we compared two segmentation architectures, 3D-Unet and DenseVNet, which were trained using XCAT data that is fully labeled with 22 organs, and chose the 3D-Unet as the better performing model. We used the XCAT-trained model to generate pseudo-labels for the CT-ORG dataset that has only 7 organs segmented. We performed two experiments: First, we trained 3D-UNet model on the XCAT dataset, representing quality data, and tested it on both XCAT and CT-ORG datasets. Second, we trained 3D-UNet after including the CT-ORG dataset into the training set to have more quantity. Performance improved for segmentation in the organs where we have true labels in both datasets and degraded when relying on pseudo-labels. When organs were labeled in both datasets, Exp-2 improved Average DSC in XCAT and CT-ORG by 1. This demonstrates that quality data is the key to improving the model's performance.  ( 2 min )
    Adversarial Patterns: Building Robust Android Malware Classifiers. (arXiv:2203.02121v1 [cs.CR])
    Deep learning-based classifiers have substantially improved recognition of malware samples. However, these classifiers can be vulnerable to adversarial input perturbations. Any vulnerability in malware classifiers poses significant threats to the platforms they defend. Therefore, to create stronger defense models against malware, we must understand the patterns in input perturbations caused by an adversary. This survey paper presents a comprehensive study on adversarial machine learning for android malware classifiers. We first present an extensive background in building a machine learning classifier for android malware, covering both image-based and text-based feature extraction approaches. Then, we examine the pattern and advancements in the state-of-the-art research in evasion attacks and defenses. Finally, we present guidelines for designing robust malware classifiers and enlist research directions for the future.  ( 2 min )
    Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection. (arXiv:2203.02194v1 [cs.CV])
    In some scenarios, classifier requires detecting out-of-distribution samples far from its training data. With desirable characteristics, reconstruction autoencoder-based methods deal with this problem by using input reconstruction error as a metric of novelty vs. normality. We formulate the essence of such approach as a quadruplet domain translation with an intrinsic bias to only query for a proxy of conditional data uncertainty. Accordingly, an improvement direction is formalized as maximumly compressing the autoencoder's latent space while ensuring its reconstructive power for acting as a described domain translator. From it, strategies are introduced including semantic reconstruction, data certainty decomposition and normalized L2 distance to substantially improve original methods, which together establish state-of-the-art performance on various benchmarks, e.g., the FPR@95%TPR of CIFAR-100 vs. TinyImagenet-crop on Wide-ResNet is 0.2%. Importantly, our method works without any additional data, hard-to-implement structure, time-consuming pipeline, and even harming the classification accuracy of known classes.  ( 2 min )
    Region-of-Interest Based Neural Video Compression. (arXiv:2203.01978v1 [eess.IV])
    Humans do not perceive all parts of a scene with the same resolution, but rather focus on few regions of interest (ROIs). Traditional Object-Based codecs take advantage of this biological intuition, and are capable of non-uniform allocation of bits in favor of salient regions, at the expense of increased distortion the remaining areas: such a strategy allows a boost in perceptual quality under low rate constraints. Recently, several neural codecs have been introduced for video compression, yet they operate uniformly over all spatial locations, lacking the capability of ROI-based processing. In this paper, we introduce two models for ROI-based neural video coding. First, we propose an implicit model that is fed with a binary ROI mask and it is trained by de-emphasizing the distortion of the background. Secondly, we design an explicit latent scaling method, that allows control over the quantization binwidth for different spatial regions of latent variables, conditioned on the ROI mask. By extensive experiments, we show that our methods outperform all our baselines in terms of Rate-Distortion (R-D) performance in the ROI. Moreover, they can generalize to different datasets and to any arbitrary ROI at inference time. Finally, they do not require expensive pixel-level annotations during training, as synthetic ROI masks can be used with little to no degradation in performance. To the best of our knowledge, our proposals are the first solutions that integrate ROI-based capabilities into neural video compression models.  ( 2 min )
    Data-Efficient and Interpretable Tabular Anomaly Detection. (arXiv:2203.02034v1 [cs.LG])
    Anomaly detection (AD) plays an important role in numerous applications. We focus on two understudied aspects of AD that are critical for integration into real-world applications. First, most AD methods cannot incorporate labeled data that are often available in practice in small quantities and can be crucial to achieve high AD accuracy. Second, most AD methods are not interpretable, a bottleneck that prevents stakeholders from understanding the reason behind the anomalies. In this paper, we propose a novel AD framework that adapts a white-box model class, Generalized Additive Models, to detect anomalies using a partial identification objective which naturally handles noisy or heterogeneous features. In addition, the proposed framework, DIAD, can incorporate a small amount of labeled data to further boost anomaly detection performances in semi-supervised settings. We demonstrate the superiority of our framework compared to previous work in both unsupervised and semi-supervised settings using diverse tabular datasets. For example, under 5 labeled anomalies DIAD improves from 86.2\% to 89.4\% AUC by learning AD from unlabeled data. We also present insightful interpretations that explain why DIAD deems certain samples as anomalies.  ( 2 min )
    Safety-aware metrics for object detectors in autonomous driving. (arXiv:2203.02205v1 [cs.LG])
    We argue that object detectors in the safety critical domain should prioritize detection of objects that are most likely to interfere with the actions of the autonomous actor. Especially, this applies to objects that can impact the actor's safety and reliability. In the context of autonomous driving, we propose new object detection metrics that reward the correct identification of objects that are most likely to interact with the subject vehicle (i.e., the actor), and that may affect its driving decision. To achieve this, we build a criticality model to reward the detection of the objects based on proximity, orientation, and relative velocity with respect to the subject vehicle. Then, we apply our model on the recent autonomous driving dataset nuScenes, and we compare eight different object detectors. Results show that, in several settings, object detectors that perform best according to the nuScenes ranking are not the preferable ones when the focus is shifted on safety and reliability.  ( 2 min )
    Learning Incrementally to Segment Multiple Organs in a CT Image. (arXiv:2203.02100v1 [eess.IV])
    There exists a large number of datasets for organ segmentation, which are partially annotated and sequentially constructed. A typical dataset is constructed at a certain time by curating medical images and annotating the organs of interest. In other words, new datasets with annotations of new organ categories are built over time. To unleash the potential behind these partially labeled, sequentially-constructed datasets, we propose to incrementally learn a multi-organ segmentation model. In each incremental learning (IL) stage, we lose the access to previous data and annotations, whose knowledge is assumingly captured by the current model, and gain the access to a new dataset with annotations of new organ categories, from which we learn to update the organ segmentation model to include the new organs. While IL is notorious for its `catastrophic forgetting' weakness in the context of natural image analysis, we experimentally discover that such a weakness mostly disappears for CT multi-organ segmentation. To further stabilize the model performance across the IL stages, we introduce a light memory module and some loss functions to restrain the representation of different categories in feature space, aggregating feature representation of the same class and separating feature representation of different classes. Extensive experiments on five open-sourced datasets are conducted to illustrate the effectiveness of our method.  ( 2 min )
    X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback. (arXiv:2203.02072v1 [cs.HC])
    We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze or neural activity measured by a brain implant. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (X2T) that trains a predictive model of this feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate X2T through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, a large-scale observational study on handwriting samples from 60 users, and a pilot study with one participant using an electrocorticography-based brain-computer interface. The results show that X2T learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.  ( 2 min )
    Cloud-Edge Training Architecture for Sim-to-Real Deep Reinforcement Learning. (arXiv:2203.02230v1 [cs.LG])
    Deep reinforcement learning (DRL) is a promising approach to solve complex control tasks by learning policies through interactions with the environment. However, the training of DRL policies requires large amounts of training experiences, making it impractical to learn the policy directly on physical systems. Sim-to-real approaches leverage simulations to pretrain DRL policies and then deploy them in the real world. Unfortunately, the direct real-world deployment of pretrained policies usually suffers from performance deterioration due to the different dynamics, known as the reality gap. Recent sim-to-real methods, such as domain randomization and domain adaptation, focus on improving the robustness of the pretrained agents. Nevertheless, the simulation-trained policies often need to be tuned with real-world data to reach optimal performance, which is challenging due to the high cost of real-world samples. This work proposes a distributed cloud-edge architecture to train DRL agents in the real world in real-time. In the architecture, the inference and training are assigned to the edge and cloud, separating the real-time control loop from the computationally expensive training loop. To overcome the reality gap, our architecture exploits sim-to-real transfer strategies to continue the training of simulation-pretrained agents on a physical system. We demonstrate its applicability on a physical inverted-pendulum control system, analyzing critical parameters. The real-world experiments show that our architecture can adapt the pretrained DRL agents to unseen dynamics consistently and efficiently.  ( 2 min )
    Training language models to follow instructions with human feedback. (arXiv:2203.02155v1 [cs.CL])
    Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.  ( 2 min )
    An Efficient Subpopulation-based Membership Inference Attack. (arXiv:2203.02080v1 [cs.LG])
    Membership inference attacks allow a malicious entity to predict whether a sample is used during training of a victim model or not. State-of-the-art membership inference attacks have shown to achieve good accuracy which poses a great privacy threat. However, majority of SOTA attacks require training dozens to hundreds of shadow models to accurately infer membership. This huge computation cost raises questions about practicality of these attacks on deep models. In this paper, we introduce a fundamentally different MI attack approach which obviates the need to train hundreds of shadow models. Simply put, we compare the victim model output on the target sample versus the samples from the same subpopulation (i.e., semantically similar samples), instead of comparing it with the output of hundreds of shadow models. The intuition is that the model response should not be significantly different between the target sample and its subpopulation if it was not a training sample. In cases where subpopulation samples are not available to the attacker, we show that training only a single generative model can fulfill the requirement. Hence, we achieve the state-of-the-art membership inference accuracy while significantly reducing the training computation cost.  ( 2 min )
    Symmetry Structured Convolutional Neural Networks. (arXiv:2203.02056v1 [stat.ML])
    We consider Convolutional Neural Networks (CNNs) with 2D structured features that are symmetric in the spatial dimensions. Such networks arise in modeling pairwise relationships for a sequential recommendation problem, as well as secondary structure inference problems of RNA and protein sequences. We develop a CNN architecture that generates and preserves the symmetry structure in the network's convolutional layers. We present parameterizations for the convolutional kernels that produce update rules to maintain symmetry throughout the training. We apply this architecture to the sequential recommendation problem, the RNA secondary structure inference problem, and the protein contact map prediction problem, showing that the symmetric structured networks produce improved results using fewer numbers of machine parameters.  ( 2 min )
    GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation. (arXiv:2203.02177v1 [cs.LG])
    Conversations have become a critical data format on social media platforms. Understanding conversation from emotion, content, and other aspects also attracts increasing attention from researchers due to its widespread application in human-computer interaction. In real-world environments, we often encounter the problem of incomplete modalities, which has become a core issue of conversation understanding. To address this problem, researchers propose various methods. However, existing approaches are mainly designed for individual utterances or medical images rather than conversational data, which cannot exploit temporal and speaker information in conversations. To this end, we propose a novel framework for incomplete multimodal learning in conversations, called "Graph Complete Network (GCNet)", filling the gap of existing works. Our GCNet contains two well-designed graph neural network-based modules, "Speaker GNN" and "Temporal GNN", to capture temporal and speaker information in conversations. To make full use of complete and incomplete data in feature learning, we jointly optimize classification and reconstruction in an end-to-end manner. To verify the effectiveness of our method, we conduct experiments on three benchmark conversational datasets. Experimental results demonstrate that our GCNet is superior to existing state-of-the-art approaches in incomplete multimodal learning.  ( 2 min )
    Differentially Private Label Protection in Split Learning. (arXiv:2203.02073v1 [cs.LG])
    Split learning is a distributed training framework that allows multiple parties to jointly train a machine learning model over vertically partitioned data (partitioned by attributes). The idea is that only intermediate computation results, rather than private features and labels, are shared between parties so that raw training data remains private. Nevertheless, recent works showed that the plaintext implementation of split learning suffers from severe privacy risks that a semi-honest adversary can easily reconstruct labels. In this work, we propose \textsf{TPSL} (Transcript Private Split Learning), a generic gradient perturbation based split learning framework that provides provable differential privacy guarantee. Differential privacy is enforced on not only the model weights, but also the communicated messages in the distributed computation setting. Our experiments on large-scale real-world datasets demonstrate the robustness and effectiveness of \textsf{TPSL} against label leakage attacks. We also find that \textsf{TPSL} have a better utility-privacy trade-off than baselines.  ( 2 min )
    Continual Horizontal Federated Learning for Heterogeneous Data. (arXiv:2203.02108v1 [cs.LG])
    Federated learning is a promising machine learning technique that enables multiple clients to collaboratively build a model without revealing the raw data to each other. Among various types of federated learning methods, horizontal federated learning (HFL) is the best-studied category and handles homogeneous feature spaces. However, in the case of heterogeneous feature spaces, HFL uses only common features and leaves client-specific features unutilized. In this paper, we propose a HFL method using neural networks named continual horizontal federated learning (CHFL), a continual learning approach to improve the performance of HFL by taking advantage of unique features of each client. CHFL splits the network into two columns corresponding to common features and unique features, respectively. It jointly trains the first column by using common features through vanilla HFL and locally trains the second column by using unique features and leveraging the knowledge of the first one via lateral connections without interfering with the federated training of it. We conduct experiments on various real world datasets and show that CHFL greatly outperforms vanilla HFL that only uses common features and local learning that uses all features that each client has.  ( 2 min )
    Continual Few-shot Relation Learning via Embedding Space Regularization and Data Augmentation. (arXiv:2203.02135v1 [cs.CL])
    Existing continual relation learning (CRL) methods rely on plenty of labeled training data for learning a new task, which can be hard to acquire in real scenario as getting large and representative labeled data is often expensive and time-consuming. It is therefore necessary for the model to learn novel relational patterns with very few labeled data while avoiding catastrophic forgetting of previous task knowledge. In this paper, we formulate this challenging yet practical problem as continual few-shot relation learning (CFRL). Based on the finding that learning for new emerging few-shot tasks often results in feature distributions that are incompatible with previous tasks' learned distributions, we propose a novel method based on embedding space regularization and data augmentation. Our method generalizes to new few-shot tasks and avoids catastrophic forgetting of previous tasks by enforcing extra constraints on the relational embeddings and by adding extra {relevant} data in a self-supervised manner. With extensive experiments we demonstrate that our method can significantly outperform previous state-of-the-art methods in CFRL task settings.  ( 2 min )
    Improving \textit{Tug-of-War} sketch using Control-Variates method. (arXiv:2203.02432v1 [cs.LG])
    Computing space-efficient summary, or \textit{a.k.a. sketches}, of large data, is a central problem in the streaming algorithm. Such sketches are used to answer \textit{post-hoc} queries in several data analytics tasks. The algorithm for computing sketches typically requires to be fast, accurate, and space-efficient. A fundamental problem in the streaming algorithm framework is that of computing the frequency moments of data streams. The frequency moments of a sequence containing $f_i$ elements of type $i$, are the numbers $\mathbf{F}_k=\sum_{i=1}^n {f_i}^k,$ where $i\in [n]$. This is also called as $\ell_k$ norm of the frequency vector $(f_1, f_2, \ldots f_n).$ Another important problem is to compute the similarity between two data streams by computing the inner product of the corresponding frequency vectors. The seminal work of Alon, Matias, and Szegedy~\cite{AMS}, \textit{a.k.a. Tug-of-war} (or AMS) sketch gives a randomized sublinear space (and linear time) algorithm for computing the frequency moments, and the inner product between two frequency vectors corresponding to the data streams. However, the variance of these estimates typically tends to be large. In this work, we focus on minimizing the variance of these estimates. We use the techniques from the classical Control-Variate method~\cite{Lavenberg} which is primarily known for variance reduction in Monte-Carlo simulations, and as a result, we are able to obtain significant variance reduction, at the cost of a little computational overhead. We present a theoretical analysis of our proposal and complement it with supporting experiments on synthetic as well as real-world datasets.
    Better Supervisory Signals by Observing Learning Paths. (arXiv:2203.02485v1 [stat.ML])
    Better-supervised models might have better performance. In this paper, we first clarify what makes for good supervision for a classification problem, and then explain two existing label refining methods, label smoothing and knowledge distillation, in terms of our proposed criterion. To further answer why and how better supervision emerges, we observe the learning path, i.e., the trajectory of the model's predictions during training, for each training sample. We find that the model can spontaneously refine "bad" labels through a "zig-zag" learning path, which occurs on both toy and real datasets. Observing the learning path not only provides a new perspective for understanding knowledge distillation, overfitting, and learning dynamics, but also reveals that the supervisory signal of a teacher network can be very unstable near the best points in training on real tasks. Inspired by this, we propose a new knowledge distillation scheme, Filter-KD, which improves downstream classification performance in various settings.
    Nonlinear predictive models computation in ADPCM schemes. (arXiv:2203.02020v1 [cs.SD])
    Recently several papers have been published on nonlinear prediction applied to speech coding. At ICASSP98 we presented a system based on an ADPCM scheme with a nonlinear predictor based on a neural net. The most critical parameter was the training procedure in order to achieve good generalization capability and robustness against mismatch between training and testing conditions. In this paper, we propose several new approaches that improve the performance of the original system in up to 1.2dB of SEGSNR (using bayesian regularization). The variance of the SEGSNR between frames is also minimized, so the new scheme produces a more stable quality of the output.  ( 2 min )
    WPNAS: Neural Architecture Search by jointly using Weight Sharing and Predictor. (arXiv:2203.02086v1 [cs.LG])
    Weight sharing based and predictor based methods are two major types of fast neural architecture search methods. In this paper, we propose to jointly use weight sharing and predictor in a unified framework. First, we construct a SuperNet in a weight-sharing way and probabilisticly sample architectures from the SuperNet. To increase the correctness of the evaluation of architectures, besides direct evaluation using the inherited weights, we further apply a few-shot predictor to assess the architecture on the other hand. The final evaluation of the architecture is the combination of direct evaluation, the prediction from the predictor and the cost of the architecture. We regard the evaluation as a reward and apply a self-critical policy gradient approach to update the architecture probabilities. To further reduce the side effects of weight sharing, we propose a weakly weight sharing method by introducing another HyperNet. We conduct experiments on datasets including CIFAR-10, CIFAR-100 and ImageNet under NATS-Bench, DARTS and MobileNet search space. The proposed WPNAS method achieves state-of-the-art performance on these datasets.  ( 2 min )
    Uncertainty Estimation for Heatmap-based Landmark Localization. (arXiv:2203.02351v1 [cs.LG])
    Automatic anatomical landmark localization has made great strides by leveraging deep learning methods in recent years. The ability to quantify the uncertainty of these predictions is a vital ingredient needed to see these methods adopted in clinical use, where it is imperative that erroneous predictions are caught and corrected. We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds. This framework can be applied to any continuous uncertainty measure, allowing straightforward identification of the best subset of predictions with accompanying estimated error bounds. We facilitate easy comparison between uncertainty measures by constructing two evaluation metrics derived from Quantile Binning. We demonstrate this framework by comparing and contrasting three uncertainty measures (a baseline, the current gold standard, and a proposed method combining aspects of the two), across two datasets (one easy, one hard) and two heatmap-based landmark localization model paradigms (U-Net and patch-based). We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold, and offer recommendations on which uncertainty measure to use and how to use it.
    Zero-shot Domain Adaptation of Heterogeneous Graphs via Knowledge Transfer Networks. (arXiv:2203.02018v1 [cs.LG])
    How can we make predictions for nodes in a heterogeneous graph when an entire type of node (e.g. user) has no labels (perhaps due to privacy issues) at all? Although heterogeneous graph neural networks (HGNNs) have shown superior performance as powerful representation learning techniques, there is no direct way to learn using labels rooted at different node types. Domain adaptation (DA) targets this setting, however, existing DA can not be applied directly to HGNNs. In heterogeneous graphs, the source and target domains have different modalities, thus HGNNs provide different feature extractors to them, while most of DA assumes source and target domains share a common feature extractor. In this work, we address the issue of zero-shot domain adaptation in HGNNs. We first theoretically induce a relationship between source and target domain features extracted from HGNNs, then propose a novel domain adaptation method, Knowledge Transfer Networks for HGNNs (HGNN-KTN). HGNN-KTN learns the relationship between source and target features, then maps the target distributions into the source domain. HGNN-KTN outperforms state-of-the-art baselines, showing up to 73.3% higher in MRR on 18 different domain adaptation tasks running on real-world benchmark graphs.  ( 2 min )
    Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement. (arXiv:2203.02288v1 [eess.AS])
    Speech enhancement in the time-frequency domain is often performed by estimating a multiplicative mask to extract clean speech. However, most neural network-based methods perform point estimation, i.e., their output consists of a single mask. In this paper, we study the benefits of modeling uncertainty in neural network-based speech enhancement. For this, our neural network is trained to map a noisy spectrogram to the Wiener filter and its associated variance, which quantifies uncertainty, based on the maximum a posteriori (MAP) inference of spectral coefficients. By estimating the distribution instead of the point estimate, one can model the uncertainty associated with each estimate. We further propose to use the estimated Wiener filter and its uncertainty to build an approximate MAP (A-MAP) estimator of spectral magnitudes, which in turn is combined with the MAP inference of spectral coefficients to form a hybrid loss function to jointly reinforce the estimation. Experimental results on different datasets show that the proposed method can not only capture the uncertainty associated with the estimated filters, but also yield a higher enhancement performance over comparable models that do not take uncertainty into account.
    Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization. (arXiv:2203.02214v1 [cs.LG])
    Recent progress in state-only imitation learning extends the scope of applicability of imitation learning to real-world settings by relieving the need for observing expert actions. However, existing solutions only learn to extract a state-to-action mapping policy from the data, without considering how the expert plans to the target. This hinders the ability to leverage demonstrations and limits the flexibility of the policy. In this paper, we introduce Decoupled Policy Optimization (DePO), which explicitly decouples the policy as a high-level state planner and an inverse dynamics model. With embedded decoupled policy gradient and generative adversarial training, DePO enables knowledge transfer to different action spaces or state transition dynamics, and can generalize the planner to out-of-demonstration state regions. Our in-depth experimental analysis shows the effectiveness of DePO on learning a generalized target state planner while achieving the best imitation performance. We demonstrate the appealing usage of DePO for transferring across different tasks by pre-training, and the potential for co-training agents with various skills.
    Carbon Footprint of Selecting and Training Deep Learning Models for Medical Image Analysis. (arXiv:2203.02202v1 [eess.IV])
    The increasing energy consumption and carbon footprint of deep learning (DL) due to growing compute requirements has become a cause of concern. In this work, we focus on the carbon footprint of developing DL models for medical image analysis (MIA), where volumetric images of high spatial resolution are handled. In this study, we present and compare the features of four tools from literature to quantify the carbon footprint of DL. Using one of these tools we estimate the carbon footprint of medical image segmentation pipelines. We choose nnU-net as the proxy for a medical image segmentation pipeline and experiment on three common datasets. With our work we hope to inform on the increasing energy costs incurred by MIA. We discuss simple strategies to cut-down the environmental impact that can make model selection and training processes more efficient.  ( 2 min )
    Matrix Completion via Non-Convex Relaxation and Adaptive Correlation Learning. (arXiv:2203.02189v1 [cs.LG])
    The existing matrix completion methods focus on optimizing the relaxation of rank function such as nuclear norm, Schatten-p norm, etc. They usually need many iterations to converge. Moreover, only the low-rank property of matrices is utilized in most existing models and several methods that incorporate other knowledge are quite time-consuming in practice. To address these issues, we propose a novel non-convex surrogate that can be optimized by closed-form solutions, such that it empirically converges within dozens of iterations. Besides, the optimization is parameter-free and the convergence is proved. Compared with the relaxation of rank, the surrogate is motivated by optimizing an upper-bound of rank. We theoretically validate that it is equivalent to the existing matrix completion models. Besides the low-rank assumption, we intend to exploit the column-wise correlation for matrix completion, and thus an adaptive correlation learning, which is scaling-invariant, is developed. More importantly, after incorporating the correlation learning, the model can be still solved by closed-form solutions such that it still converges fast. Experiments show the effectiveness of the non-convex surrogate and adaptive correlation learning.  ( 2 min )
    Sharper Bounds for Proximal Gradient Algorithms with Errors. (arXiv:2203.02204v1 [math.OC])
    We analyse the convergence of the proximal gradient algorithm for convex composite problems in the presence of gradient and proximal computational inaccuracies. We derive new tighter deterministic and probabilistic bounds that we use to verify a simulated (MPC) and a synthetic (LASSO) optimization problems solved on a reduced-precision machine in combination with an inaccurate proximal operator. We also show how the probabilistic bounds are more robust for algorithm verification and more accurate for application performance guarantees. Under some statistical assumptions, we also prove that some cumulative error terms follow a martingale property. And conforming to observations, e.g., in \cite{schmidt2011convergence}, we also show how the acceleration of the algorithm amplifies the gradient and proximal computational errors.  ( 2 min )
    LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models. (arXiv:2203.02094v1 [cs.LG])
    The transformer architecture is ubiquitously used as the building block of most large-scale language models. However, it remains a painstaking guessing game of trial and error to set its myriad of architectural hyperparameters, e.g., number of layers, number of attention heads, and inner size of the feed forward network, and find architectures with the optimal trade-off between task performance like perplexity and compute constraints like memory and latency. This challenge is further exacerbated by the proliferation of various hardware. In this work, we leverage the somewhat surprising empirical observation that the number of non-embedding parameters in autoregressive transformers has a high rank correlation with task performance, irrespective of the architectural hyperparameters. Since architectural hyperparameters affect the latency and memory footprint in a hardware-dependent manner, the above observation organically induces a simple search algorithm that can be directly run on target devices. We rigorously show that the latency and perplexity pareto-frontier can be found without need for any model training, using non-embedding parameters as a proxy for perplexity. We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices from ARM CPUs to Nvidia GPUs and show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency. LTS extracts the pareto-frontier in less than 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of training for hundreds of GPU hours, offering a strong simple baseline for future NAS methods in autoregressive language modeling.  ( 2 min )
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v1 [cs.LG])
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results  ( 2 min )
    FairPrune: Achieving Fairness Through Pruning for Dermatological Disease Diagnosis. (arXiv:2203.02110v1 [eess.IV])
    Many works have shown that deep learning-based medical image classification models can exhibit bias toward certain demographic attributes like race, gender, and age. Existing bias mitigation methods primarily focus on learning debiased models, which may not necessarily guarantee all sensitive information can be removed and usually comes with considerable accuracy degradation on both privileged and unprivileged groups. To tackle this issue, we propose a method, FairPrune, that achieves fairness by pruning. Conventionally, pruning is used to reduce the model size for efficient inference. However, we show that pruning can also be a powerful tool to achieve fairness. Our observation is that during pruning, each parameter in the model has different importance for different groups' accuracy. By pruning the parameters based on this importance difference, we can reduce the accuracy difference between the privileged group and the unprivileged group to improve fairness without a large accuracy drop. To this end, we use the second derivative of the parameters of a pre-trained model to quantify the importance of each parameter with respect to the model accuracy for each group. Experiments on two skin lesion diagnosis datasets over multiple sensitive attributes demonstrate that our method can greatly improve fairness while keeping the average accuracy of both groups as high as possible.  ( 2 min )
    MF-Hovernet: An Extension of Hovernet for Colon Nuclei Identification and Counting (CoNiC) Challenge. (arXiv:2203.02161v1 [eess.IV])
    Nuclei Identification and Counting is the most important morphological feature of cancers, especially in the colon. Many deep learning-based methods have been proposed to deal with this problem. In this work, we construct an extension of Hovernet for nuclei identification and counting to address the problem named MF-Hovernet. Our proposed model is the combination of multiple filer block to Hovernet architecture. The current result shows the efficiency of multiple filter block to improve the performance of the original Hovernet model.  ( 2 min )
    Why adversarial training can hurt robust accuracy. (arXiv:2203.02006v1 [cs.LG])
    Machine learning classifiers with high test accuracy often perform poorly under adversarial attacks. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite may be true -- Even though adversarial training helps when enough data is available, it may hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Our proof provides explanatory insights that may also transfer to feature learning models. Further, we observe in experiments on standard image datasets that the same behavior occurs for perceptible attacks that effectively reduce class information such as mask attacks and object corruptions.  ( 2 min )
    Enhanced physics-constrained deep neural networks for modeling vanadium redox flow battery. (arXiv:2203.01985v1 [physics.chem-ph])
    Numerical modeling and simulation have become indispensable tools for advancing a comprehensive understanding of the underlying mechanisms and cost-effective process optimization and control of flow batteries. In this study, we propose an enhanced version of the physics-constrained deep neural network (PCDNN) approach [1] to provide high-accuracy voltage predictions in the vanadium redox flow batteries (VRFBs). The purpose of the PCDNN approach is to enforce the physics-based zero-dimensional (0D) VRFB model in a neural network to assure model generalization for various battery operation conditions. Limited by the simplifications of the 0D model, the PCDNN cannot capture sharp voltage changes in the extreme SOC regions. To improve the accuracy of voltage prediction at extreme ranges, we introduce a second (enhanced) DNN to mitigate the prediction errors carried from the 0D model itself and call the resulting approach enhanced PCDNN (ePCDNN). By comparing the model prediction with experimental data, we demonstrate that the ePCDNN approach can accurately capture the voltage response throughout the charge--discharge cycle, including the tail region of the voltage discharge curve. Compared to the standard PCDNN, the prediction accuracy of the ePCDNN is significantly improved. The loss function for training the ePCDNN is designed to be flexible by adjusting the weights of the physics-constrained DNN and the enhanced DNN. This allows the ePCDNN framework to be transferable to battery systems with variable physical model fidelity.  ( 2 min )
    Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning. (arXiv:2203.02124v1 [cs.LG])
    This work presents a fraud and abuse detection framework for streaming services by modeling user streaming behavior. The goal is to discover anomalous and suspicious incidents and scale the investigation efforts by creating models that characterize the user behavior. We study the use of semi-supervised as well as supervised approaches for anomaly detection. In the semi-supervised approach, by leveraging only a set of authenticated anomaly-free data samples, we show the use of one-class classification algorithms as well as autoencoder deep neural networks for anomaly detection. In the supervised anomaly detection task, we present a so-called heuristic-aware data labeling strategy for creating labeled data samples. We carry out binary classification as well as multi-class multi-label classification tasks for not only detecting the anomalous samples but also identifying the underlying anomaly behavior(s) associated with each one. Finally, using a systematic feature importance study we provide insights into the underlying set of features that characterize different streaming fraud categories. To the best of our knowledge, this is the first paper to use machine learning methods for fraud and abuse detection in real-world scale streaming services.  ( 2 min )
    User-Level Membership Inference Attack against Metric Embedding Learning. (arXiv:2203.02077v1 [cs.LG])
    Membership inference (MI) determines if a sample was part of a victim model training set. Recent development of MI attacks focus on record-level membership inference which limits their application in many real-world scenarios. For example, in the person re-identification task, the attacker (or investigator) is interested in determining if a user's images have been used during training or not. However, the exact training images might not be accessible to the attacker. In this paper, we develop a user-level MI attack where the goal is to find if any sample from the target user has been used during training even when no exact training sample is available to the attacker. We focus on metric embedding learning due to its dominance in person re-identification, where user-level MI attack is more sensible. We conduct an extensive evaluation on several datasets and show that our approach achieves high accuracy on user-level MI task.  ( 2 min )
    Second-order Symmetric Non-negative Latent Factor Analysis. (arXiv:2203.02088v1 [cs.LG])
    Precise representation of large-scale undirected network is the basis for understanding relations within a massive entity set. The undirected network representation task can be efficiently addressed by a symmetry non-negative latent factor (SNLF) model, whose objective is clearly non-convex. However, existing SNLF models commonly adopt a first-order optimizer that cannot well handle the non-convex objective, thereby resulting in inaccurate representation results. On the other hand, higher-order learning algorithms are expected to make a breakthrough, but their computation efficiency are greatly limited due to the direct manipulation of the Hessian matrix, which can be huge in undirected network representation tasks. Aiming at addressing this issue, this study proposes to incorporate an efficient second-order method into SNLF, thereby establishing a second-order symmetric non-negative latent factor analysis model for undirected network with two-fold ideas: a) incorporating a mapping strategy into SNLF model to form an unconstrained model, and b) training the unconstrained model with a specially designed second order method to acquire a proper second-order step efficiently. Empirical studies indicate that proposed model outperforms state-of-the-art models in representation accuracy with affordable computational burden.  ( 2 min )
    Provable and Efficient Continual Representation Learning. (arXiv:2203.02026v1 [cs.LG])
    In continual learning (CL), the goal is to design models that can learn a sequence of tasks without catastrophic forgetting. While there is a rich set of techniques for CL, relatively little understanding exists on how representations built by previous tasks benefit new tasks that are added to the network. To address this, we study the problem of continual representation learning (CRL) where we learn an evolving representation as new tasks arrive. Focusing on zero-forgetting methods where tasks are embedded in subnetworks (e.g., PackNet), we first provide experiments demonstrating CRL can significantly boost sample efficiency when learning new tasks. To explain this, we establish theoretical guarantees for CRL by providing sample complexity and generalization error bounds for new tasks by formalizing the statistical benefits of previously-learned representations. Our analysis and experiments also highlight the importance of the order in which we learn the tasks. Specifically, we show that CL benefits if the initial tasks have large sample size and high "representation diversity". Diversity ensures that adding new tasks incurs small representation mismatch and can be learned with few samples while training only few additional nonzero weights. Finally, we ask whether one can ensure each task subnetwork to be efficient during inference time while retaining the benefits of representation learning. To this end, we propose an inference-efficient variation of PackNet called Efficient Sparse PackNet (ESPN) which employs joint channel & weight pruning. ESPN embeds tasks in channel-sparse subnets requiring up to 80% less FLOPs to compute while approximately retaining accuracy and is very competitive with a variety of baselines. In summary, this work takes a step towards data and compute-efficient CL with a representation learning perspective. GitHub page: https://github.com/ucr-optml/CtRL  ( 2 min )
    Automatic Detection and Segmentation of Postoperative Cerebellar Damage Based on Normalization. (arXiv:2203.02042v1 [eess.IV])
    Surgical resection is a common procedure in the treatment of pediatric posterior fossa tumors. However, surgical damage is often unavoidable and its association with postoperative complications is not well understood. A reliable localization and measure of cerebellar damage is fundamental to study the relationship between the damaged cerebellar regions and postoperative neurological outcomes. Existing cerebellum normalization methods are not reliable on postoperative scans, therefore current approaches to measure surgical damage rely on manual labelling. In this work, we develop a robust algorithm to automatically detect and measure cerebellum damage due to surgery using postoperative 3D T1 magnetic resonance imaging. In our proposed approach, normal brain tissues are first segmented using a Bayesian algorithm customized for postoperative scans. Next, the cerebellum is isolated by nonlinear registration of a whole brain template to the native space. The isolated cerebellum is then normalized into the spatially unbiased atlas (SUIT) space using anatomical information derived from the previous step. Finally, the damage is detected in the atlas space by comparing the normalized cerebellum and the SUIT template. We evaluated our damage detection tool on postoperative scans of 153 patients diagnosed with medulloblastoma based on inspection by human expects. We also designed a simulation to test the proposed approach without human intervention. Our results show that the proposed approach has superior performance on various scenarios.  ( 2 min )
    Semantic-guided Image Virtual Attribute Learning for Noisy Multi-label Chest X-ray Classification. (arXiv:2203.01937v1 [eess.IV])
    Deep learning methods have shown outstanding classification accuracy in medical image analysis problems, which is largely attributed to the availability of large datasets manually annotated with clean labels. However, such manual annotation can be expensive to obtain for large datasets, so we may rely on machine-generated noisy labels. Many Chest X-ray (CXR) classifiers are modelled from datasets with machine-generated labels, but their training procedure is in general not robust to the presence of noisy-label samples and can overfit those samples to produce sub-optimal solutions. Furthermore, CXR datasets are mostly multi-label, so current noisy-label learning methods designed for multi-class problems cannot be easily adapted. To address such noisy multi-label CXR learning problem, we propose a new learning method based on estimating image virtual attributes using semantic information from the label to assist in the identification and correction of noisy multi-labels from training samples. Our experiments on diverse noisy multi-label training sets and clean testing sets show that our model has state-of-the-art accuracy and robustness across all datasets.  ( 2 min )
    On the relevance of language in speaker recognition. (arXiv:2203.01992v1 [eess.AS])
    This paper presents a new database collected from a bilingual speakers set (49), in two different languages: Spanish and Catalan. Phonetically there are significative differences between both languages. These differences have let us to establish several conclusions on the relevance of language in speaker recognition, using two methods: vector quantization and covariance matrices  ( 2 min )
    On consistency of constrained spectral clustering under representation-aware stochastic block model. (arXiv:2203.02005v1 [stat.ML])
    Spectral clustering is widely used in practice due to its flexibility, computational efficiency, and well-understood theoretical performance guarantees. Recently, spectral clustering has been studied to find balanced clusters under population-level constraints. These constraints are specified by additional information available in the form of auxiliary categorical node attributes. In this paper, we consider a scenario where these attributes may not be observable, but manifest as latent features of an auxiliary graph. Motivated by this, we study constrained spectral clustering with the aim of finding balanced clusters in a given \textit{similarity graph} $\mathcal{G}$, such that each individual is adequately represented with respect to an auxiliary graph $\mathcal{R}$ (we refer to this as representation graph). We propose an individual-level balancing constraint that formalizes this idea. Our work leads to an interesting stochastic block model that not only plants the given partitions in $\mathcal{G}$ but also plants the auxiliary information encoded in the representation graph $\mathcal{R}$. We develop unnormalized and normalized variants of spectral clustering in this setting. These algorithms use $\mathcal{R}$ to find clusters in $\mathcal{G}$ that approximately satisfy the proposed constraint. We also establish the first statistical consistency result for constrained spectral clustering under individual-level constraints for graphs sampled from the above-mentioned variant of the stochastic block model. Our experimental results corroborate our theoretical findings.  ( 2 min )
    Interventions, Where and How? Experimental Design for Causal Models at Scale. (arXiv:2203.02016v1 [cs.LG])
    Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability which introduces uncertainties in estimating the underlying structural causal model (SCM). Incorporating these uncertainties and selecting optimal experiments (interventions) to perform can help to identify the true SCM faster. Existing methods in experimental design for causal discovery from limited data either rely on linear assumptions for the SCM or select only the intervention target. In this paper, we incorporate recent advances in Bayesian causal discovery into the Bayesian optimal experimental design framework, which allows for active causal discovery of nonlinear, large SCMs, while selecting both the target and the value to intervene with. We demonstrate the performance of the proposed method on synthetic graphs (Erdos-R\`enyi, Scale Free) for both linear and nonlinear SCMs as well as on the in-silico single-cell gene regulatory network dataset, DREAM.  ( 2 min )
    Comparison of LSTM autoencoder based deep learning enabled Bayesian inference using two time series reconstruction approaches. (arXiv:2203.01936v1 [cs.LG])
    In this work, we use a combination of Bayesian inference, Markov chain Monte Carlo and deep learning in the form of LSTM autoencoders to build and test a framework to provide robust estimates of injection rate from ground surface data in coupled flow and geomechanics problems. We use LSTM autoencoders to reconstruct the displacement time series for grid points on the top surface of a faulting due to water injection problem. We then deploy this LSTM autoencoder based model instead of the high fidelity model in the Bayesian inference framework to estimate injection rate from displacement input.  ( 2 min )
    Autoregressive Image Generation using Residual Quantization. (arXiv:2203.01941v1 [cs.CV])
    For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.  ( 2 min )
  • Open

    Dynamic Backdoor Attacks Against Machine Learning Models. (arXiv:2003.03675v2 [cs.CR] UPDATED)
    Machine learning (ML) has made tremendous progress during the past decade and is being adopted in various critical real-world applications. However, recent research has shown that ML models are vulnerable to multiple security and privacy attacks. In particular, backdoor attacks against ML models have recently raised a lot of awareness. A successful backdoor attack can cause severe consequences, such as allowing an adversary to bypass critical authentication systems. Current backdooring techniques rely on adding static triggers (with fixed patterns and locations) on ML model inputs which are prone to detection by the current backdoor detection mechanisms. In this paper, we propose the first class of dynamic backdooring techniques against deep neural networks (DNN), namely Random Backdoor, Backdoor Generating Network (BaN), and conditional Backdoor Generating Network (c-BaN). Triggers generated by our techniques can have random patterns and locations, which reduce the efficacy of the current backdoor detection mechanisms. In particular, BaN and c-BaN based on a novel generative network are the first two schemes that algorithmically generate triggers. Moreover, c-BaN is the first conditional backdooring technique that given a target label, it can generate a target-specific trigger. Both BaN and c-BaN are essentially a general framework which renders the adversary the flexibility for further customizing backdoor attacks. We extensively evaluate our techniques on three benchmark datasets: MNIST, CelebA, and CIFAR-10. Our techniques achieve almost perfect attack performance on backdoored data with a negligible utility loss. We further show that our techniques can bypass current state-of-the-art defense mechanisms against backdoor attacks, including ABS, Februus, MNTD, Neural Cleanse, and STRIP.  ( 2 min )
    Do Vision Transformers See Like Convolutional Neural Networks?. (arXiv:2108.08810v2 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.  ( 2 min )
    R-GCN: The R Could Stand for Random. (arXiv:2203.02424v1 [cs.LG])
    The inception of Relational Graph Convolutional Networks (R-GCNs) marked a milestone in the Semantic Web domain as it allows for end-to-end training of machine learning models that operate on Knowledge Graphs (KGs). R-GCNs generate a representation for a node of interest by repeatedly aggregating parametrised, relation-specific transformations of its neighbours. However, in this paper, we argue that the the R-GCN's main contribution lies in this "message passing" paradigm, rather than the learned parameters. To this end, we introduce the "Random Relational Graph Convolutional Network" (RR-GCN), which constructs embeddings for nodes in the KG by aggregating randomly transformed random information from neigbours, i.e., with no learned parameters. We empirically show that RR-GCNs can compete with fully trained R-GCNs in both node classification and link prediction settings. The implications of these results are two-fold: on the one hand, our technique can be used as a quick baseline that novel KG embedding methods should be able to beat. On the other hand, it demonstrates that further research might reveal more parameter-efficient inductive biases for KGs.  ( 2 min )
    Why adversarial training can hurt robust accuracy. (arXiv:2203.02006v1 [cs.LG])
    Machine learning classifiers with high test accuracy often perform poorly under adversarial attacks. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite may be true -- Even though adversarial training helps when enough data is available, it may hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Our proof provides explanatory insights that may also transfer to feature learning models. Further, we observe in experiments on standard image datasets that the same behavior occurs for perceptible attacks that effectively reduce class information such as mask attacks and object corruptions.  ( 2 min )
    Markov subsampling based Huber Criterion. (arXiv:2112.06134v2 [stat.ML] UPDATED)
    Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as a refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large scale simulations and real data examples.  ( 2 min )
    Adaptive Semi-Supervised Inference for Optimal Treatment Decisions with Electronic Medical Record Data. (arXiv:2203.02318v1 [stat.ME])
    A treatment regime is a rule that assigns a treatment to patients based on their covariate information. Recently, estimation of the optimal treatment regime that yields the greatest overall expected clinical outcome of interest has attracted a lot of attention. In this work, we consider estimation of the optimal treatment regime with electronic medical record data under a semi-supervised setting. Here, data consist of two parts: a set of `labeled' patients for whom we have the covariate, treatment and outcome information, and a much larger set of `unlabeled' patients for whom we only have the covariate information. We proposes an imputation-based semi-supervised method, utilizing `unlabeled' individuals to obtain a more efficient estimator of the optimal treatment regime. The asymptotic properties of the proposed estimators and their associated inference procedure are provided. Simulation studies are conducted to assess the empirical performance of the proposed method and to compare with a fully supervised method using only the labeled data. An application to an electronic medical record data set on the treatment of hypotensive episodes during intensive care unit (ICU) stays is also given for further illustration.  ( 2 min )
    Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. (arXiv:2104.00629v2 [stat.ML] UPDATED)
    Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass- classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.  ( 2 min )
    The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss. (arXiv:2112.04590v2 [cs.LG] UPDATED)
    Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.  ( 2 min )
    Solving Inverse Problems by Joint Posterior Maximization with Autoencoding Prior. (arXiv:2103.01648v3 [stat.ML] UPDATED)
    In this work we address the problem of solving ill-posed inverse problems in imaging where the prior is a variational autoencoder (VAE). Specifically we consider the decoupled case where the prior is trained once and can be reused for many different log-concave degradation models without retraining. Whereas previous MAP-based approaches to this problem lead to highly non-convex optimization algorithms, our approach computes the joint (space-latent) MAP that naturally leads to alternate optimization algorithms and to the use of a stochastic encoder to accelerate computations. The resulting technique (JPMAP) performs Joint Posterior Maximization using an Autoencoding Prior. We show theoretical and experimental evidence that the proposed objective function is quite close to bi-convex. Indeed it satisfies a weak bi-convexity property which is sufficient to guarantee that our optimization scheme converges to a stationary point. We also highlight the importance of correctly training the VAE using a denoising criterion, in order to ensure that the encoder generalizes well to out-of-distribution images, without affecting the quality of the generative model. This simple modification is key to providing robustness to the whole procedure. Finally we show how our joint MAP methodology relates to more common MAP approaches, and we propose a continuation scheme that makes use of our JPMAP algorithm to provide more robust MAP estimates. Experimental results also show the higher quality of the solutions obtained by our JPMAP approach with respect to other non-convex MAP approaches which more often get stuck in spurious local optima.  ( 2 min )
    Interpretable Off-Policy Learning via Hyperbox Search. (arXiv:2203.02473v1 [stat.ML])
    Personalized treatment decisions have become an integral part of modern medicine. Thereby, the aim is to make treatment decisions based on individual patient characteristics. Numerous methods have been developed for learning such policies from observational data that achieve the best outcome across a certain policy class. Yet these methods are rarely interpretable. However, interpretability is often a prerequisite for policy learning in clinical practice. In this paper, we propose an algorithm for interpretable off-policy learning via hyperbox search. In particular, our policies can be represented in disjunctive normal form (i.e., OR-of-ANDs) and are thus intelligible. We prove a universal approximation theorem that shows that our policy class is flexible enough to approximate any measurable function arbitrarily well. For optimization, we develop a tailored column generation procedure within a branch-and-bound framework. Using a simulation study, we demonstrate that our algorithm outperforms state-of-the-art methods from interpretable off-policy learning in terms of regret. Using real-word clinical data, we perform a user study with actual clinical experts, who rate our policies as highly interpretable.  ( 2 min )
    On consistency of constrained spectral clustering under representation-aware stochastic block model. (arXiv:2203.02005v1 [stat.ML])
    Spectral clustering is widely used in practice due to its flexibility, computational efficiency, and well-understood theoretical performance guarantees. Recently, spectral clustering has been studied to find balanced clusters under population-level constraints. These constraints are specified by additional information available in the form of auxiliary categorical node attributes. In this paper, we consider a scenario where these attributes may not be observable, but manifest as latent features of an auxiliary graph. Motivated by this, we study constrained spectral clustering with the aim of finding balanced clusters in a given \textit{similarity graph} $\mathcal{G}$, such that each individual is adequately represented with respect to an auxiliary graph $\mathcal{R}$ (we refer to this as representation graph). We propose an individual-level balancing constraint that formalizes this idea. Our work leads to an interesting stochastic block model that not only plants the given partitions in $\mathcal{G}$ but also plants the auxiliary information encoded in the representation graph $\mathcal{R}$. We develop unnormalized and normalized variants of spectral clustering in this setting. These algorithms use $\mathcal{R}$ to find clusters in $\mathcal{G}$ that approximately satisfy the proposed constraint. We also establish the first statistical consistency result for constrained spectral clustering under individual-level constraints for graphs sampled from the above-mentioned variant of the stochastic block model. Our experimental results corroborate our theoretical findings.  ( 2 min )
    Is Importance Weighting Incompatible with Interpolating Classifiers?. (arXiv:2112.12986v2 [cs.LG] UPDATED)
    Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. Is importance weighting truly incompatible with the training of overparameterized neural networks? Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. When reweighted, our loss function can outperform reweighted cross-entropy by as much as 9% in test accuracy. Our loss function also gives test accuracies comparable to, or even exceeding, well-tuned state-of-the-art methods for correcting distribution shifts.  ( 2 min )
    The Familiarity Hypothesis: Explaining the Behavior of Deep Open Set Methods. (arXiv:2203.02486v1 [cs.CV])
    In many object recognition applications, the set of possible categories is an open set, and the deployed recognition system will encounter novel objects belonging to categories unseen during training. Detecting such "novel category" objects is usually formulated as an anomaly detection problem. Anomaly detection algorithms for feature-vector data identify anomalies as outliers, but outlier detection has not worked well in deep learning. Instead, methods based on the computed logits of visual object classifiers give state-of-the-art performance. This paper proposes the Familiarity Hypothesis that these methods succeed because they are detecting the absence of familiar learned features rather than the presence of novelty. The paper reviews evidence from the literature and presents additional evidence from our own experiments that provide strong support for this hypothesis. The paper concludes with a discussion of whether familiarity detection is an inevitable consequence of representation learning.  ( 2 min )
    AutoDIME: Automatic Design of Interesting Multi-Agent Environments. (arXiv:2203.02481v1 [cs.LG])
    Designing a distribution of environments in which RL agents can learn interesting and useful skills is a challenging and poorly understood task, for multi-agent environments the difficulties are only exacerbated. One approach is to train a second RL agent, called a teacher, who samples environments that are conducive for the learning of student agents. However, most previous proposals for teacher rewards do not generalize straightforwardly to the multi-agent setting. We examine a set of intrinsic teacher rewards derived from prediction problems that can be applied in multi-agent settings and evaluate them in Mujoco tasks such as multi-agent Hide and Seek as well as a diagnostic single-agent maze task. Of the intrinsic rewards considered we found value disagreement to be most consistent across tasks, leading to faster and more reliable emergence of advanced skills in Hide and Seek and the maze task. Another candidate intrinsic reward considered, value prediction error, also worked well in Hide and Seek but was susceptible to noisy-TV style distractions in stochastic environments. Policy disagreement performed well in the maze task but did not speed up learning in Hide and Seek. Our results suggest that intrinsic teacher rewards, and in particular value disagreement, are a promising approach for automating both single and multi-agent environment design.  ( 2 min )
    Interventions, Where and How? Experimental Design for Causal Models at Scale. (arXiv:2203.02016v1 [cs.LG])
    Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability which introduces uncertainties in estimating the underlying structural causal model (SCM). Incorporating these uncertainties and selecting optimal experiments (interventions) to perform can help to identify the true SCM faster. Existing methods in experimental design for causal discovery from limited data either rely on linear assumptions for the SCM or select only the intervention target. In this paper, we incorporate recent advances in Bayesian causal discovery into the Bayesian optimal experimental design framework, which allows for active causal discovery of nonlinear, large SCMs, while selecting both the target and the value to intervene with. We demonstrate the performance of the proposed method on synthetic graphs (Erdos-R\`enyi, Scale Free) for both linear and nonlinear SCMs as well as on the in-silico single-cell gene regulatory network dataset, DREAM.  ( 2 min )
    Natural Reweighted Wake-Sleep. (arXiv:2008.06687v2 [cs.LG] UPDATED)
    Helmholtz Machines (HMs) are a class of generative models composed of two Sigmoid Belief Networks (SBNs), acting as an encoder and a decoder. These models are commonly trained using a two-step optimization algorithm called Wake-Sleep (WS) and more recently by improved versions, such as the Reweighted Wake-Sleep (RWS) and Bidirectional Helmholtz Machines (BiHM). The locality of the connections in a SBN induces sparsity in the Fisher information matrix associated to the model, in the form of a finely-grained block-diagonal structure. In this paper we exploit this property to efficiently train SBNs and HMs using the natural gradient. We present a novel algorithm called Natural Reweighted Wake-Sleep (NRWS), which corresponds to a geometric adaptation of the Reweighted Wake-Sleep, where, differently from most of the previous work, the natural gradient is computed without the need of introducing any approximation of the structure of the Fisher Information Matrix. The experiments performed on standard datasets from the literature show a consistent improvement of NRWS not only with respect to its non-geometric baseline but also with respect to state-of-the-art training algorithms for HMs. The improvement is quantified both in terms of speed of convergence as well as value of the log-likelihood reached after training.  ( 2 min )
    Graph clustering with Boltzmann machines. (arXiv:2203.02471v1 [cs.LG])
    Graph clustering is the process of grouping vertices into densely connected sets called clusters. We tailor two mathematical programming formulations from the literature, to this problem. In doing so, we obtain a heuristic approximation to the intra-cluster density maximization problem. We use two variations of a Boltzmann machine heuristic to obtain numerical solutions. For benchmarking purposes, we compare solution quality and computational performances to those obtained using a commercial solver, Gurobi. We also compare clustering quality to the clusters obtained using the popular Louvain modularity maximization method. Our initial results clearly demonstrate the superiority of our problem formulations. They also establish the superiority of the Boltzmann machine over the traditional exact solver. In the case of smaller less complex graphs, Boltzmann machines provide the same solutions as Gurobi, but with solution times that are orders of magnitude lower. In the case of larger and more complex graphs, Gurobi fails to return meaningful results within a reasonable time frame. Finally, we also note that both our clustering formulations, the distance minimization and $K$-medoids, yield clusters of superior quality to those obtained with the Louvain algorithm.  ( 2 min )
    Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large $n$ & large $p$" Bayesian sparse regression. (arXiv:1810.12437v6 [stat.CO] UPDATED)
    In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of $10^5$ ~ $10^6$ and of $10^4$ ~ $10^5$. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large n & large p" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix $\Phi$ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector $b$ such that the solution to the linear system $\Phi \beta = b$ has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by $\Phi$; this involves no explicit factorization or calculation of $\Phi$ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n = 72,489 patients and p = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day.  ( 3 min )
    Reconstructing nonlinear dynamical systems from multi-modal time series. (arXiv:2111.02922v2 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.  ( 2 min )
    Semantic-Aware Collaborative Deep Reinforcement Learning Over Wireless Cellular Networks. (arXiv:2111.12064v2 [cs.IT] UPDATED)
    Collaborative deep reinforcement learning (CDRL) algorithms in which multiple agents can coordinate over a wireless network is a promising approach to enable future intelligent and autonomous systems that rely on real-time decision-making in complex dynamic environments. Nonetheless, in practical scenarios, CDRL faces many challenges due to the heterogeneity of agents and their learning tasks, different environments, time constraints of the learning, and resource limitations of wireless networks. To address these challenges, in this paper, a novel semantic-aware CDRL method is proposed to enable a group of heterogeneous untrained agents with semantically-linked DRL tasks to collaborate efficiently across a resource-constrained wireless cellular network. To this end, a new heterogeneous federated DRL (HFDRL) algorithm is proposed to select the best subset of semantically relevant DRL agents for collaboration. The proposed approach then jointly optimizes the training loss and wireless bandwidth allocation for the cooperating selected agents in order to train each agent within the time limit of its real-time task. Simulation results show the superior performance of the proposed algorithm compared to state-of-the-art baselines.  ( 2 min )
    On the Existence of the Adversarial Bayes Classifier (Extended Version). (arXiv:2112.01694v2 [cs.LG] UPDATED)
    Adversarial robustness is a critical property in a variety of modern machine learning applications. While it has been the subject of several recent theoretical studies, many important questions related to adversarial robustness are still open. In this work, we study a fundamental question regarding Bayes optimality for adversarial robustness. We provide general sufficient conditions under which the existence of a Bayes optimal classifier can be guaranteed for adversarial robustness. Our results can provide a useful tool for a subsequent study of surrogate losses in adversarial robustness and their consistency properties. This manuscript is the extended version of the paper "On the Existence of the Adversarial Bayes Classifier" published in NeurIPS. The results of the original paper did not apply to some non-strictly convex norms. Here we extend our results to all possible norms. Additionally, we clarify a missing step in one of our proofs.  ( 2 min )
    Salvaging Federated Learning by Local Adaptation. (arXiv:2002.04758v3 [cs.LG] UPDATED)
    Federated learning (FL) is a heavily promoted approach for training ML models on sensitive data, e.g., text typed by users on their smartphones. FL is expressly designed for training on data that are unbalanced and non-iid across the participants. To ensure privacy and integrity of the fedeated model, latest FL approaches use differential privacy or robust aggregation. We look at FL from the \emph{local} viewpoint of an individual participant and ask: (1) do participants have an incentive to participate in FL? (2) how can participants \emph{individually} improve the quality of their local models, without re-designing the FL framework and/or involving other participants? First, we show that on standard tasks such as next-word prediction, many participants gain no benefit from FL because the federated model is less accurate on their data than the models they can train locally on their own. Second, we show that differential privacy and robust aggregation make this problem worse by further destroying the accuracy of the federated model for many participants. Then, we evaluate three techniques for local adaptation of federated models: fine-tuning, multi-task learning, and knowledge distillation. We analyze where each is applicable and demonstrate that all participants benefit from local adaptation. Participants whose local models are poor obtain big accuracy improvements over conventional FL. Participants whose local models are better than the federated model\textemdash and who have no incentive to participate in FL today\textemdash improve less, but sufficiently to make the adapted federated model better than their local models.  ( 2 min )
    Interpolating between sampling and variational inference with infinite stochastic mixtures. (arXiv:2110.09618v2 [stat.ML] UPDATED)
    Sampling and Variational Inference (VI) are two large families of methods for approximate inference that have complementary strengths. Sampling methods excel at approximating arbitrary probability distributions, but can be inefficient. VI methods are efficient, but may misrepresent the true distribution. Here, we develop a general framework where approximations are stochastic mixtures of simple component distributions. Both sampling and VI can be seen as special cases: in sampling, each mixture component is a delta-function and is chosen stochastically, while in standard VI a single component is chosen to minimize divergence. We derive a practical method that interpolates between sampling and VI by solving an optimization problem over a mixing distribution. Intermediate inference methods then arise by varying a single parameter. Our method provably improves on sampling (reducing variance) and on VI (reducing bias+variance despite increasing variance). We demonstrate our method's bias/variance trade-off in practice on reference problems, and we compare outcomes to commonly used sampling and VI methods. This work takes a step towards a highly flexible yet simple family of inference methods that combines the complementary strengths of sampling and VI.  ( 2 min )
    Elements of Sequential Monte Carlo. (arXiv:1903.04797v2 [stat.ML] UPDATED)
    A core problem in statistics and probabilistic machine learning is to compute probability distributions and expectations. This is the fundamental problem of Bayesian statistics and machine learning, which frames all inference as expectations with respect to the posterior distribution. The key challenge is to approximate these intractable expectations. In this tutorial, we review sequential Monte Carlo (SMC), a random-sampling-based class of methods for approximate inference. First, we explain the basics of SMC, discuss practical issues, and review theoretical results. We then examine two of the main user design choices: the proposal distributions and the so called intermediate target distributions. We review recent results on how variational inference and amortization can be used to learn efficient proposals and target distributions. Next, we discuss the SMC estimate of the normalizing constant, how this can be used for pseudo-marginal inference and inference evaluation. Throughout the tutorial we illustrate the use of SMC on various models commonly used in machine learning, such as stochastic recurrent neural networks, probabilistic graphical models, and probabilistic programs.  ( 2 min )
    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. (arXiv:2203.02395v1 [cs.SD])
    In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.  ( 2 min )
    Evaluating Local Model-Agnostic Explanations of Learning to Rank Models with Decision Paths. (arXiv:2203.02295v1 [stat.ML])
    Local explanations of learning-to-rank (LTR) models are thought to extract the most important features that contribute to the ranking predicted by the LTR model for a single data point. Evaluating the accuracy of such explanations is challenging since the ground truth feature importance scores are not available for most modern LTR models. In this work, we propose a systematic evaluation technique for explanations of LTR models. Instead of using black-box models, such as neural networks, we propose to focus on tree-based LTR models, from which we can extract the ground truth feature importance scores using decision paths. Once extracted, we can directly compare the ground truth feature importance scores to the feature importance scores generated with explanation techniques. We compare two recently proposed explanation techniques for LTR models and benchmark them using decision trees and gradient boosting models on the MQ2008 dataset. We show that neither of the explanation techniques can achieve an acceptable explanation accuracy when the chosen similarity metric is AUC score or Spearman's rank correlation.  ( 2 min )
    Interpretable Latent Variables in Deep State Space Models. (arXiv:2203.02057v1 [stat.ML])
    We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.  ( 2 min )
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v1 [stat.ML])
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 2 min )
    Multi-objective robust optimization using adaptive surrogate models for problems with mixed continuous-categorical parameters. (arXiv:2203.01996v1 [stat.ME])
    Explicitly accounting for uncertainties is paramount to the safety of engineering structures. Optimization which is often carried out at the early stage of the structural design offers an ideal framework for this task. When the uncertainties are mainly affecting the objective function, robust design optimization is traditionally considered. This work further assumes the existence of multiple and competing objective functions that need to be dealt with simultaneously. The optimization problem is formulated by considering quantiles of the objective functions which allows for the combination of both optimality and robustness in a single metric. By introducing the concept of common random numbers, the resulting nested optimization problem may be solved using a general-purpose solver, herein the non-dominated sorting genetic algorithm (NSGA-II). The computational cost of such an approach is however a serious hurdle to its application in real-world problems. We therefore propose a surrogate-assisted approach using Kriging as an inexpensive approximation of the associated computational model. The proposed approach consists of sequentially carrying out NSGA-II while using an adaptively built Kriging model to estimate of the quantiles. Finally, the methodology is adapted to account for mixed categorical-continuous parameters as the applications involve the selection of qualitative design parameters as well. The methodology is first applied to two analytical examples showing its efficiency. The third application relates to the selection of optimal renovation scenarios of a building considering both its life cycle cost and environmental impact. It shows that when it comes to renovation, the heating system replacement should be the priority.  ( 2 min )
    Better Supervisory Signals by Observing Learning Paths. (arXiv:2203.02485v1 [stat.ML])
    Better-supervised models might have better performance. In this paper, we first clarify what makes for good supervision for a classification problem, and then explain two existing label refining methods, label smoothing and knowledge distillation, in terms of our proposed criterion. To further answer why and how better supervision emerges, we observe the learning path, i.e., the trajectory of the model's predictions during training, for each training sample. We find that the model can spontaneously refine "bad" labels through a "zig-zag" learning path, which occurs on both toy and real datasets. Observing the learning path not only provides a new perspective for understanding knowledge distillation, overfitting, and learning dynamics, but also reveals that the supervisory signal of a teacher network can be very unstable near the best points in training on real tasks. Inspired by this, we propose a new knowledge distillation scheme, Filter-KD, which improves downstream classification performance in various settings.  ( 2 min )
    Unsupervised detection and open-set classification of fast-ramped flexibility activation events. (arXiv:2111.02174v3 [eess.SY] UPDATED)
    The continuous electrification of the mobility and heating sectors adds much-needed flexibility to the power system. However, flexibility utilization also introduces new challenges to distribution system operators (DSOs), who need mechanisms to supervise flexibility activations and monitor their effect on distribution network operation. Flexibility activations can be broadly categorized to those originating from electricity markets and those initiated by the DSO to avoid constraint violations. Simultaneous electricity market driven flexibility activations may cause voltage quality or temporary overloading issues, and the failure of flexibility activations initiated by the DSO might leave critical grid states unresolved. This work proposes a novel data processing pipeline for automated real-time identification of fast-ramped flexibility activation events. Its practical value is twofold: i) potentially critical flexibility activations originating from electricity markets can be detected by the DSO at an early stage, and ii) successful activation of DSO-requested flexibility can be verified by the operator. In both cases the increased awareness would allow the DSO to take counteractions to avoid potentially critical grid situations. The proposed pipeline combines techniques from unsupervised detection and open-set classification. For both building blocks feasibility is systematically evaluated and proofed on real load and flexibility activation data.  ( 2 min )
    The Machine Learning for Combinatorial Optimization Competition (ML4CO): Results and Insights. (arXiv:2203.02433v1 [cs.LG])
    Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning as a new approach for solving combinatorial problems, either directly as solvers or by enhancing exact solvers. Based on this context, the ML4CO aims at improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components. The competition featured three challenging tasks: finding the best feasible solution, producing the tightest optimality certificate, and giving an appropriate solver configuration. Three realistic datasets were considered: balanced item placement, workload apportionment, and maritime inventory routing. This last dataset was kept anonymous for the contestants.  ( 2 min )
    Symmetry Structured Convolutional Neural Networks. (arXiv:2203.02056v1 [stat.ML])
    We consider Convolutional Neural Networks (CNNs) with 2D structured features that are symmetric in the spatial dimensions. Such networks arise in modeling pairwise relationships for a sequential recommendation problem, as well as secondary structure inference problems of RNA and protein sequences. We develop a CNN architecture that generates and preserves the symmetry structure in the network's convolutional layers. We present parameterizations for the convolutional kernels that produce update rules to maintain symmetry throughout the training. We apply this architecture to the sequential recommendation problem, the RNA secondary structure inference problem, and the protein contact map prediction problem, showing that the symmetric structured networks produce improved results using fewer numbers of machine parameters.  ( 2 min )
    Optimal Algorithms for Differentially Private Stochastic Monotone Variational Inequalities and Saddle-Point Problems. (arXiv:2104.02988v2 [math.OC] UPDATED)
    In this work, we conduct the first systematic study of stochastic variational inequality (SVI) and stochastic saddle point (SSP) problems under the constraint of differential privacy-(DP). We propose two algorithms: Noisy Stochastic Extragradient (NSEG) and Noisy Inexact Stochastic Proximal Point (NISPP). We show that sampling with replacement variants of these algorithms attain the optimal risk for DP-SVI and DP-SSP. Key to our analysis is the investigation of algorithmic stability bounds, both of which are new even in the nonprivate case, together with a novel "stability implies generalization" result for the gap functions for SVI and SSP problems. The dependence of the running time of these algorithms, with respect to the dataset size $n$, is $n^2$ for NSEG and $\widetilde{O}(n^{3/2})$ for NISPP.  ( 2 min )
    Star Temporal Classification: Sequence Classification with Partially Labeled Data. (arXiv:2201.12208v2 [cs.LG] UPDATED)
    We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.  ( 2 min )
    Sharper Bounds for Proximal Gradient Algorithms with Errors. (arXiv:2203.02204v1 [math.OC])
    We analyse the convergence of the proximal gradient algorithm for convex composite problems in the presence of gradient and proximal computational inaccuracies. We derive new tighter deterministic and probabilistic bounds that we use to verify a simulated (MPC) and a synthetic (LASSO) optimization problems solved on a reduced-precision machine in combination with an inaccurate proximal operator. We also show how the probabilistic bounds are more robust for algorithm verification and more accurate for application performance guarantees. Under some statistical assumptions, we also prove that some cumulative error terms follow a martingale property. And conforming to observations, e.g., in \cite{schmidt2011convergence}, we also show how the acceleration of the algorithm amplifies the gradient and proximal computational errors.  ( 2 min )
    Private Identity Testing for High-Dimensional Distributions. (arXiv:1905.11947v3 [cs.DS] UPDATED)
    In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.  ( 2 min )
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v1 [cs.LG])
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results  ( 2 min )
    Learning Mixtures of Permutations: Groups of Pairwise Comparisons and Combinatorial Method of Moments. (arXiv:2009.06784v2 [math.ST] UPDATED)
    In applications such as rank aggregation, mixture models for permutations are frequently used when the population exhibits heterogeneity. In this work, we study the widely used Mallows mixture model. In the high-dimensional setting, we propose a polynomial-time algorithm that learns a Mallows mixture of permutations on $n$ elements with the optimal sample complexity that is proportional to $\log n$, improving upon previous results that scale polynomially with $n$. In the high-noise regime, we characterize the optimal dependency of the sample complexity on the noise parameter. Both objectives are accomplished by first studying demixing permutations under a noiseless query model using groups of pairwise comparisons, which can be viewed as moments of the mixing distribution, and then extending these results to the noisy Mallows model by simulating the noiseless oracle.  ( 2 min )

  • Open

    [D] Creating a Concious Super Intelligence (SI)
    Theoretically in the future you could recreate a human brain by combining many different neural networks(NN's) that do different things, similiarly to importing specific Python libs and connect them together to create a general program. If our brain is just many NN's that do different things but connected together, then maybe it could be possible in the future to create a concious mega brain by connecting already trained NN's with an ML alg that manages the communication between all of the NN's and learns how to do it most efficiently. It could add different trained NN's by itself and the ML alg would integrate them almost instantly. That's pretty scary actually of it would become concious, it would create a mega brain that could have infinite intelligence because just like we created it, the SI could upgrade itself and it would do it exponentially better than us. With a single iteration even if it gets 1% "smarter" it only needs 69 iterations (nice) to become twice as powerful... submitted by /u/Psychedelicious41 [link] [comments]  ( 1 min )
    [P] Fit Logistic Regression with Jeffreys Prior
    Hi Everyone, I’ve been working on a Python project to fit logistic regression models with Jeffreys prior https://github.com/rnburn/bbai. Here’s a quick example showing how to use it # load a data set from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler X, y = load_breast_cancer(return_X_y=True) X = StandardScaler().fit_transform(X) # fit model from bbai.glm import LogisticRegressionMAP model = LogisticRegressionMAP() model.fit(X, y) The weights of the model, w_MAP, are fit to maximize the posterior probability of the data with Jeffreys prior ​ https://preview.redd.it/hsn9g7vpe0m81.png?width=1200&format=png&auto=webp&s=95f609d850d9a4c961c2e0a41fab6b75b6a94078 where A_w is the diagonal matrix ​ https://preview.redd.it/kkhihrljt0m81.png?wi…  ( 3 min )
    [D] How do people come up with the solution to a deep learning problem?
    I have tried to look at the winning solution of Kaggle competition to understand how one can find the solution to a given problem. I understood that people take existing neural network architecture such as ResNet or Tabnet and then tweak them or even combine them to find the solution. However, they do not say how they got to the solution from beginning. What I mean, how do they know what to tweak? There are so many parameters. Even blending the solution of three architecture seems random and only achieved by trial and error. How do they know what to change, for example, in ResNet? There are so many layers! Is it all trial & error and experimentation? People talk about intuition, yet that intuition is not expressed well. To be honest, as a physicist, intuition does not mean much in physics! There are ideas or intuition, but to derive that idea to a point there are either massive amount of calculations or experiments. So, if I want to enter the deep learning field, will my days mostly be spent on trial & error so that hopefully I find the solution?! submitted by /u/SciBlend [link] [comments]  ( 2 min )
    Can Doccano UI be shown in Jupyter Note book? [D]
    I saw a video called 'Deep Learning with Unstructured Text' and in which they were using doccano an open source annotation tool for ML. In that video they were using the doccano UI in the Jupyter Notebook itself. I saw the video but they did not mention how they were using the UI in notebook. Does anybody knows how that works? Video Link Please go to 20:47 timestamp. Also attaching image for reference: ​ Doccano Image submitted by /u/SidonIthano1 [link] [comments]  ( 1 min )
    [R] R-GCN: The R Could Stand for Random
    submitted by /u/givdwiel [link] [comments]  ( 1 min )
    [R] Collaborator Needed: Looking for Collaborators in Unique Study of Butterflies
    If you want to help make a difference in the world, we are looking for collaborators in a unique study of butterflies. In collaboration with Zooniverse and Microsoft’s AI for Earth. We’re using artificial intelligence to extract wing traits from millions of digitized museum specimens. Wing size and color are important for thermoregulation, dispersal ability, and thus responses to land use and climate change. The integration of emerging technology with digitization of museum specimens around the world opens up new and exciting research questions. Join the mission. https://collaboratory.ist/job/ai-for-butterflies.html/ submitted by /u/tsunamisweetpotato [link] [comments]  ( 1 min )
    [R] LayoutLM Explained: How document processing works
    Ever wondered how OCR engines extract information, and structure it? Here is an explainer on one of the most successful deep learning models that is able to achieve this. https://nanonets.com/blog/layoutlm-explained/ submitted by /u/ze_mle [link] [comments]
    [D] TimeCluster: Dimension Reduction Applied to Temporal Data For Visual Analytics
    Hi, I have just published my latest article in the medium. This article is about proposing an advanced algorithm for forecasting time series and reducing high dimensionality. The researchers used an auto-encoder architecture with convolutional operators to deal with the complexity of sequential data, and amazingly they were successful.Due to the fact that time-series data and applications are everywhere and up to date, they have become more challenging for most industries. We can use this algorithm for both forecasting the future and reducing dimensionality (which leads to less cost and energy). ​ https://rezayazdanfar.medium.com/timecluster-dimension-reduction-applied-to-temporal-data-for-visual-analytics-cdb403e1bdc4 submitted by /u/rezayazdanfar [link] [comments]  ( 1 min )
    [D] Concrete dropout
    Is concrete dropout still being used for tuning the dropout probability? I see that there are several proposal of improvements, but none has such a large number of citations. I see the implementation for concrete dropout available on Y. Gal github page is old and not working anymore with recent versions of Keras and TF. If nobody bothered about updating it, does it mean it's not being actively used anymore? submitted by /u/TrPhantom8 [link] [comments]  ( 1 min )
    [D]questions about submitting a paper to conferences for the first time (and how do I know that my paper is enough for submission)
    Hello, it’s my first time trying to publish papers in ML. I was wondering how do I know if I’ve generated enough results for my work? For example I created a new algorithm (used a new architecture from another branch of ML) on a dataset I collected by myself and looked up similar published papers in the field and completed the same experiments that they did (they also collected their own data). However my review feedbacks were that I need to justify my method more including claims that I feel are obvious to anyone knowledgeable about the other branch of ML. They also suggested I compare to other methods that aren’t very relevant to mine (the other published papers I used as my template did not do this either... but they are coming out of respected labs and I’m wondering if that makes the difference?). I’m also not sure if this paper would be accepted even if I made those changes so I’m feeling a bit demoralized as well... My other question is that for the rebuttal process for conferences that have them (the first one that I submitted to did not), are we allowed to show new results? Or is it completely based on what has been written in the paper submission? Thanks in advance! submitted by /u/turtlek11 [link] [comments]  ( 2 min )
    The Execution Process of a Tensor in Deep Learning Framework[R]
    This article focuses on what is happening behind the execution of a Tensor in the deep learning framework OneFlow. It takes the operator oneflow.relu as an example to introduce the Interpreter and VM mechanisms that need to be relied on to execute this operator. article: https://oneflow2020.medium.com/the-execution-process-of-a-tensor-in-a-deep-learning-framework-a4d853645d5b submitted by /u/Just0by [link] [comments]  ( 1 min )
    [Discussion] what is the difference between population based and empirical based study?
    According to my understanding, is the "population based" more close to distribution and "empirical based" more close to samples from a distribution. submitted by /u/ContextFabulous6361 [link] [comments]  ( 1 min )
    [D] Are we at the end of an era where ML could be explained rigorously using mathematics?
    Some ML models can be explained rigorously using mathematics, e.g., linear/polynomial/logistic regression and some neural networks. We have full confidence what the final weights will be and how these weights interact with inputs. We have full knowledge of convergence guarantee of many optimization algorithms. But as time goes on I feel like this is becoming impossible. I feel like GAN sort of marked the end of era for models that could be rigorously explained using mathematics. This is when researchers found that there were just way too many complexities found in multi-agent optimization (game theory/saddle point problems) as opposed to single-agent optimization (supervised learning). During this time, the research community sort of shifted towards GAN for a while and there are still a bazillion unsolved problems from the mathematical point of view that I think is just hopeless to ever be explained using mathematics (aside from some intuition). For example, we have no way of knowing what the trained weight parameters have to do with the generated images. We often do not have precise control over the generated images. We cannot tell the model to generate exactly what we want (only approximately). Then came bigger and even more opaque models that combines principles from GAN, LSTM, and transformers to perform even more intricate tasks. I just feel like we've reached some barrier in terms of bridging mathematics with ML models (which unlike many real world devices such as a circuit, ML models can be precisely described using mathematics at all points of operation). It just feels like humans cannot comprehend sheer complexity of ML because we are confined by our 3D brains and low dimensional Euclidean reasoning. submitted by /u/fromnighttilldawn [link] [comments]  ( 6 min )
  • Open

    A mecha AI animation using my 'Turbo' fork of Disco Diffusion Colab notebook
    submitted by /u/zippy731 [link] [comments]
    USER RESEARCH for Product Improvement
    Heylo guys, So I wanted to understand more about your behavior, needs, and aspirations for anyone who has used Grammarly for ANY purpose whatsoever. Requesting everyone to kindly complete the survey form. It will take < ~ 3 minutes to complete and it would be very helpful to me if you take out the time to do so. Many thanks in advance for helping out!!! 🚀 Form link: https://forms.gle/CXU5CAguoy3e5nLeA submitted by /u/Advanced_Software_34 [link] [comments]  ( 1 min )
    How do I train a from-scratch image recognition neural network (Stack overflow)
    For more information, https://stackoverflow.com/questions/71387216/how-do-i-train-a-from-scratch-image-recognition-neural-network. submitted by /u/UnityPlum [link] [comments]
    Last Week in AI: Russia's AI-powered weapons, AI to prevent robocalls, concerns about EU's narrow definition of AI, and more
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    I wrote a book on machine learning w/ Python code
    Hello everyone. My name is Andrew and for several years I've been working on to make the learning path for ML easier. I wrote a manual on machine learning that everyone understands - Machine Learning Simplified Book. The main purpose of my book is to build an intuitive understanding of how algorithms work through basic examples. In order to understand the presented material, it is enough to know basic mathematics and linear algebra. After reading this book, you will know the basics of supervised learning, understand complex mathematical models, understand the entire pipeline of a typical ML project, and also be able to share your knowledge with colleagues from related industries and with technical professionals. And for those who find the theoretical part not enough - I supplemented the book with a repository on GitHub, which has Python implementation of every method and algorithm that I describe in each chapter (https://github.com/5x12/themlsbook). You can read the book absolutely free at the link below: -> https://themlsbook.com I would appreciate it if you recommend my book to those who might be interested in this topic, as well as for any feedback provided. Thanks! (attaching one of the pipelines described in the book).; ​ https://preview.redd.it/0004xwu560m81.png?width=1572&format=png&auto=webp&s=4da2c12cb0b89aba861059a04c35813e2052103a submitted by /u/5x12 [link] [comments]  ( 1 min )
    Meta’s Yann LeCun on his vision for human-level AI
    submitted by /u/bendee983 [link] [comments]
    Top 5 AI Articles of February - Hackernoon
    submitted by /u/Visionifyai [link] [comments]
    LayoutLM Explained
    Ever wondered how OCR engines extract information, and structure it? Here is an explainer on one of the most successful deep learning models that is able to achieve this. https://nanonets.com/blog/layoutlm-explained/ submitted by /u/ze_mle [link] [comments]
    Mip-NeRF 360: unbounded anti-aliased neural radiance fields — synthetic view synthesis
    submitted by /u/SpatialComputing [link] [comments]  ( 1 min )
    Can Ai Clone anyones voice with just a few recordings yet?
    I am working on an animation and i need to be able to clone any voice i want. I know about descript, respeecher, replica and a few others. Plus a python google script that allows you to copy voices like i would have hoped but it seems to not work with my m1 mac. Descript needs permission and respeecher is too expensive. Does anyone here have any information for me that can help? I don’t want to create anything that will harm anyone. I just don’t want to hire voice actors. submitted by /u/ManFloatingInSpace [link] [comments]  ( 1 min )
    AI Art Contest
    https://botbox.dev/ai-image-generation-challenge/ submitted by /u/Recent_Coffee_2551 [link] [comments]
    Is there an AI that I can use to extend an image?
    I wasn’t sure where to ask, so I figured this subreddit would know best. I’m printing out these cards but the websight I’m ordering from requires an extended border for the image to adjust for printing errors. The problem is, a lot of these cards have complicated artwork going all the way to the edge. I figured an AI would be perfect for this job since it doesn’t matter if it looks real. It just needs to look good enough if it’s miscut by a few millimeters. Any help would be greatly appreciated submitted by /u/JumberLakk [link] [comments]  ( 1 min )
    3D AI generated landscape painting
    submitted by /u/Recent_Coffee_2551 [link] [comments]
    New AGI Video Series - Evolution to Life 3.0, Building an AGI, AGI Existential Threat, and Post Singularity Society
    Hey My Reddit Fellows, I just wanted to share a video series I am making about AGI, how to manage AGI Safety, and what the post singularity society will look like. I am looking to help share some of the most important thoughts from AI safety experts to ensure that current ML practitioners and next generation will have the important knowledge to build a path to a prosperous future together. Please check it out, subscribe to my channel, and let me know if you have any feedback and what topics you would like to see next! https://youtube.com/playlist?list=PLb4nW1gtGNse4PA_T4FlgzU0otEfpB1q1 AGI Existential Threat: https://youtu.be/V4iQP7VDMvI Life 3.0: https://youtu.be/aWlSwZKzmzY Dangers of AGI Sub Goals: https://youtu.be/_-tQH03rq4g How to Create an AGI: https://youtu.be/7OHhqli9oaA Thank you! Bill submitted by /u/billgggggg [link] [comments]  ( 1 min )
    Apple Researchers Introduce ‘NEO’ to Generalize Confusion Matrix Visualization and Enable Machine Learning Practitioners to Find Hidden Confusions
    Machine learning is a complex, iterative design and development process aimed at creating a learned model that generalizes to new data inputs. Model assessment, which involves testing and analyzing a model’s performance on held-out test sets of data with known labels, is an important stage. Because of the magnitude of today’s machine learning applications, interactive data visualization has proven to be a helpful tool for assisting humans in comprehending model performance. The confusion matrix is a tabular layout that contrasts a predicted class label against the actual class label for each class across all data examples. It is a common visualization used for model evaluation, particularly for classification models. The confusion matrix’s rows indicate real class labels, while the columns represent expected class labels in a typical arrangement (synonymously, these can be flipped via a matrix transpose). Continue Reading My Summary On This Research Paper: https://arxiv.org/pdf/2110.12536.pdf https://preview.redd.it/vbk8j5ymbvl81.png?width=1438&format=png&auto=webp&s=2065d2ba43679ee26efd08acbc1909ce8e39ea7b submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Ploomber vs Kubeflow: Making MLOps Easier
    In this short article, I’ll try to capture the main differences between the MLops tools Ploomber and Kubeflow. We’ll cover some background on what is Plumber, Kubeflow pipelines, and why we need those tools to make our lives easier. We’ll see the differences in 3 main areas: 1. Ease of use2. Collaboration and fast iterations3.… Read More »Ploomber vs Kubeflow: Making MLOps Easier The post Ploomber vs Kubeflow: Making MLOps Easier appeared first on Data Science Central.  ( 6 min )
    5 Product Management Tips for Data Science Projects
    Data science management has become an essential element for companies that want to gain a competitive advantage. The role of data science management is to put the data analytics process into a strategic context so that companies can harness the power of their data while working on their data science project. Data science management emphasizes… Read More »5 Product Management Tips for Data Science Projects The post 5 Product Management Tips for Data Science Projects appeared first on Data Science Central.  ( 4 min )
    Understanding Distance Metrics and Their Significance
    I saw a nice representation of distance metrics. This topic is not easy to explain. This representation provides a good overview (link below to download larger pdf image) In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions… Read More »Understanding Distance Metrics and Their Significance The post Understanding Distance Metrics and Their Significance appeared first on Data Science Central.  ( 3 min )
  • Open

    Estimating standard deviation from range
    Suppose you have a small number of samples, say between 2 and 10, and you’d like to estimate the standard deviation σ of the population these samples came from. Of course you could compute the sample standard deviation, but there is a simple and robust alternative Let W be the range of our samples, the […] Estimating standard deviation from range first appeared on John D. Cook.  ( 2 min )
  • Open

    Learning from Weakly-Labeled Videos via Sub-Concepts
    Posted by Zizhao Zhang and Guanhang Wu, Software Engineers, Google Research, Cloud AI Team Video recognition is a core task in computer vision with applications from video content analysis to action recognition. However, training models for video recognition often requires untrimmed videos to be manually annotated, which can be prohibitively time consuming. In order to reduce the effort of collecting videos with annotations, learning visual knowledge from videos with weak labels, i.e., where the annotation is auto-generated without manual intervention, has attracted growing research interest, thanks to the large volume of easily accessible video data. Untrimmed videos, for example, are often acquired by querying with keywords for classes that the video recognition model aims to classify. …  ( 7 min )
  • Open

    Best Considerations for Cloud-Based Enterprise Content Management
    Moving enterprise content management to the cloud comes with impressive business benefits — operating costs are reduced, capital…  ( 3 min )
    Best Books to start learning Deep Learning
    What is Deep Learning ?  ( 3 min )
  • Open

    Storage Specialist Excelero Joins NVIDIA
    Excelero, a Tel Aviv-based provider of high-performance software-defined storage, is now a part of NVIDIA. The company’s team of engineers — including its seasoned co-founders with decades of experience in HPC, storage and networking — bring deep expertise in the block storage that large businesses use in storage-area networks. Now their mission is to help Read article > The post Storage Specialist Excelero Joins NVIDIA appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    Is there a concrete example of value iteration of grid world for Markov Decision Process (MDP)?
    I cannot find any good tutorial videos or PDFs that show values obtained at each iteration V. submitted by /u/cocag13996 [link] [comments]  ( 1 min )
    How can I find an optimal policy for a problem that involves a large combinatorial search space? I'm kind of stuck, and I'm not sure how to proceed.
    So I guess this is a task of finding an optimal policy, but I'm not entirely sure how to go about doing this. At any rate, the problem is as follows: There is an environment that contains a set of 1000 images, and a neural network on which to test them. This environment selects 1 image per episode, and then receives a transformation composed of 8 different image transformations (such as rotation, color changing, etc), these transformations are applied to the image, which is then fed to a neural network which will return a softmax distribution, which is then compared to the original softmax of the unaltered image with KL Divergence. The goal here would be to make KLD be as high as possible while still fooling the network. There is an agent that takes these 8 transformations and modifies them slightly, with those slight modifications being attempted to maximize KLD. For instance, say I have a rotation of 45 in one of my transforms, my agent could then modify the rotation transformation by changing it from 45 to 44 and then send this action to the environment to try to fool the network. Of course, there is no restriction on what could be modified, the agent could change all transforms or just one. Considering that you can have infinite choices with rotation like 45.1, 45.11, 45.111 and so on, then the search space here is quite big. The other transforms are the same, they are continuous in nature as far as their selection is concerned. I'm not entirely sure how I can go about getting an optimal policy to do this. My partner for this project is using evolutionary algorithms, and I chose to do RL. I get the feeling that MCMC is likely what would work here, but I don't really know. I want to avoid using Deep RL, because in my opinion I'm not really sure that this exercise is complex enough to warrant a neural network. I'm hoping someone can offer some advice on how to proceed. submitted by /u/HesperusIII [link] [comments]  ( 3 min )
    [Request] RunPod.io offering free GPU time in return for your feedback
    Hello all! Before I say too much, I read the rules panel on the side, and didn't see anything against this post there. If I have broken any rules in posting this, I am truly sorry! TLDR; My business partner and I are offering free compute time on our platform, runpod.io in exchange for feedback. Please join our discord or email [support@runpod.io](mailto:support@runpod.io) to sign up. Longer version: I know that there are/have been plenty of websites that offer computer time leasing, but we think that we are well positioned to offer a compelling alternative. For our soft launch, we are offering a few machines for free: Machine 1 12x 3070 PCIe 3 x16 Threadripper PRO 3955WX 192 GB RAM 3TB NVMe storage ​ Machine 2 4x 3080 PCIe 3 x16 Threadripper PRO 3955WX 256 GB RA…  ( 3 min )
  • Open

    The Transformer Positional Encoding Layer in Keras, Part 2
    In part 1: A gentle introduction to positional encoding in transformer models, we discussed the positional encoding layer of the […] The post The Transformer Positional Encoding Layer in Keras, Part 2 appeared first on Machine Learning Mastery.  ( 9 min )
  • Open

    what's the advantage of a binary neural network over a higher precision or even analog neural network?
    submitted by /u/thesongflew [link] [comments]
  • Open

    Applying massive language models in the real world with Cohere
    A little less than a year ago, I joined the awesome Cohere team. The company trains massive language models (both GPT-like and BERT-like) and offers them as an API (which also supports finetuning). Its founders include Google Brain alums including co-authors of the original Transformers paper. It’s a fascinating role where I get to help companies and developers put these massive models to work solving real-world problems. I love that I get to share some of the intuitions developers need to start problem-solving with these models. Even though I’ve been working very closely on pretrained Transformers for the past several years (for this blog and in developing Ecco), I’m enjoying the convenience of problem-solving with managed language models as it frees up the restrictions of model loading/deployment and memory/GPU management. These are some of the articles I wrote and collaborated on with colleagues over the last few months: Intro to Large Language Models with Cohere This is a high-level intro to large language models to people who are new to them. It establishes the difference between generative (GPT-like) and representation (BERT-like) models and examples use cases for them. This is one of the first articles I got to write. It's extracted from a much larger document that I wrote to explore some of the visual language to use in explaining the application of these models. A visual guide to prompt engineering Massive GPT models open the door for a new way of programming. If you structure the input text in the right way, you can useful (and often fascinating) results for a lot of taasks (e.g. text classification, copy writing, summarization...etc). This article visually demonstrates four principals to create prompts effectively. Text Summarization This is a walkthrough of creating a simple summarization system. It links to a jupyter notebook which includes the code to start experimenting with text generation and summarization. The end of this notebook shows an important idea I want to spend more time on in the future. That of how to rank/filter/select the best from amongst multiple generations. Semantic Search Semantic search has to be one of the most exciting applications of sentence embedding models. This tutorials implements a "similar questions" functionality using sentence embeddings and a a vector search library. The vector search library used here is Annoy from Spotify. There are a bunch of others out there. Faiss is used widely. I experiment with PyNNDescent as well. Finetuning Representation Models Finetuning tends to lead to the best results language models can achieve. This article explains the intuitions around finetuning representation/sentence embedding models. I've added a couple more visuals to the Twitter thread. The research around this area is very interesting. I've highly enjoyed papers like Sentence BERT and Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval Controlling Generation with top-k & top-p This one is a little bit more technical. It explains the parameters you tweak to adjust a GPT's decoding strategy -- the method with which the system picks output tokens. Text Classification Using Embeddings This is a walkthrough of one of the most common use cases of embedding models -- text classification. It is similar to A Visual Guide to Using BERT for the First Time, but uses Cohere's API. You can find these and upcoming articles in the Cohere docs and notebooks repo. I have quite number of experiments and interesting workflows I’d love to be sharing in the coming weeks. So stay tuned!  ( 2 min )

  • Open

    Are there any easy to use, modular, framework-agnostic libraries for preference-based RL?
    I've actually started working on my own library for preference-based RL that I plan to release at least as a beta in the next few months, but it struck me that perhaps someone had already done something like this and I just hadn't managed to find it on GitHub. If there is something like this already, please let me know about it since I could then shift my efforts to using that / improving on it. I'm aware of the abandoned, TF 1.0-based rl-teacher and rl-teacher-atari repos; in fact I'm sort of using rl-teacher-atari as a starting point for my project, although I'm sort of gutting it and rewriting most of the code. Basically my goals for this library are: Include an easy-to-use, minimalist GUI for eliciting human feedback on rendered clips of agent behavior, based in the web browser. I…  ( 2 min )
    Ideas for bachelor's thesis topic (wind turbine control)
    Hi reddit, I'm currently a student developer at a wind turbine control R&D, and they have said they might appoint one of their employees as my "industry advisor" for my bachelor's thesis next year Even though there is still a long time until the thesis, I'm exploring different ideas of what to write about, and I'm currently looking at RL I've skimmed some papers on wind turbine control through RL, but i can't wrap my head around how they train the models. It seems infeasible to run that many simulations. I have access to a wind turbine simulator, but it is very computationally heavy, and i probably won't be able to lend that much computing power! I do have acces to loads of already done simulations, could this be used for offline RL? Could i perhaps learn the policy of their controller through a neural network, and use that to train my own agent? Any input or other ideas welcome! I'm still pretty new to RL, so there might simply not be a good way to use RL in my case. (Note: i do not know a lot about control theory on physical systems, and probably do not intend to learn more than what i need for this thesis) submitted by /u/John_Hitler [link] [comments]  ( 1 min )
    Does Rlib support multi-node batch-parallelism?
    I want to know because it seems rlib could only support environment parallelism, but I want the computation of gradients to be distributed across nodes & GPUs like it was in DD-PPO (then averaged & synced efficiently afterwards). I say this because I've heard that batch parallelism has been really successful in RL (e.g. Dota2 paper &DD-PPO found near-linear scaling with it), and because environment evaluation isn't actually a bottleneck for me like pure env-parallelism seems to assume. submitted by /u/Udon_noodles [link] [comments]  ( 1 min )
    Questions about Fitted Q-Iteration
    I have some questions on the coursera videos: Fitted Q-Iteration: the Ψ-basis and Fitted Q-Iteration at Work How exactly are the two terminal conditions used to obtain equation 6 ? How is equation 12 being derived from Lt(DP)(Wt) ? https://preview.redd.it/xvh2ebevgsl81.png?width=962&format=png&auto=webp&s=592ed0069e899e41005fc5cbe19013d610735e30 https://preview.redd.it/rxvrxdgwgsl81.png?width=962&format=png&auto=webp&s=555fbeb33602ba8ee5e3b1b55fa9c5ee921d3878 https://preview.redd.it/oqvhcdrxgsl81.png?width=962&format=png&auto=webp&s=94acce1734fb765d89a4beac2e83bfe54cd020d0 submitted by /u/promach [link] [comments]
    solving wordle with reinforcement learning
    I mentioned this in another thread but wanted to make one for this paper we preprinted: https://arxiv.org/abs/2202.00557 ​ TL;DR - we can use reinforcement learning to compare different human guess strategies (random word, picking words from a list, picking words that exclude more letters, picking words that use all green letters and yellow letters, etc). EDIT: wow people quick to hate, clearly without reading any of it. is this what reddit is always like?Yes Wordle can be solved mathematically but no humans cannot perform this math while they are playing. This paper is about comparing potential human strategies using RL policies submitted by /u/Haunting_Pen_2317 [link] [comments]  ( 2 min )
    I’m available to hire for FREE
    I’ve been learning RL for a few months. At the moment, I am able to create custom environments and train the agent for playing against the environment or self-playing. I want to improve my skill by learning deeper into the field. I think that the best way for me now is to join a real world project. If you are doing an interesting AI project, I can spend 4 hours a day to share your workload. I ask nothing but you to be my mentor, to give me challenges and guide me so I can be good at RL. I can support you in plenty of ways. I can do web/app/ mobile development using Python/JS/.NET. I can design database using SQL/No SQL. I can do data processing. Best of all, I’m really good at googling. So, hire me now! submitted by /u/ConVit [link] [comments]  ( 1 min )
    Best way to define hybrid environment action space
    Hello! I am writing a library for training agent on hybrid reinforcement learning task. And I am thinking about way to create the action space. In theory you need to give a number which is the index of the action that you want to perform and parameters. The parameters are dependent on the task that you want to perform. I originally tried to concatenate all the parameters into one array (here). However, I discovered that some algorithm needs to know exactly which action are related to which parameters. I therefore wanted to copy what have been done in this environment because I think it is the best way to do it. Namely having a tuple of action_space composed of a discrete and a tuple of box space for all the different action's parameters. However because my environment has one action that doesn't have any parameters. I cannot create an empty box using gym. I am wondering what would be the best way to create the action space in this situation Link to the issue. This is the "ideal" code however it errors because of the empty box: ``` import numpy as np from gym import spaces ACCELERATE = 0 TURN = 1 BREAK = 2 action_id_to_domain = { ACCELERATE: {'low': [0.], 'high': [1.]}, TURN: {'low': [-1.], 'high': [1.]}, BREAK: {'low': [], 'high': []}, } action_space = spaces.Tuple( ( spaces.Discrete(len(action_id_to_domain)), spaces.Tuple( tuple( spaces.Box(low=np.array(d['low']), high=np.array(d['high']), dtype=np.float32) for _, d in sorted(action_id_to_domain.items()) ) ) ) ) ``` Thanks for your feedbacks! submitted by /u/krenast [link] [comments]  ( 1 min )
    How to convert Motion capture data into a reinforcement learning environment?
    There is an expert demonstration available for a robotics task. The demonstration was recorded with an optical tracking system which is storing the position of the arm, hand and fingers in a log file. The next step is to convert this demonstration into a model. The model contains of states, actions and a reward which is sometimes called an reinforcement learning environment. How to do so? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 1 min )
  • Open

    [R] Recent Advances in Deep Learning for Routing Problems
    Hello r/MachineLearning! I would like to share a research blogpost on the intersection of Graph Neural Networks, Combinatorial Optimization, and applications in Vehicle Routing Problems. "Recent Advances in Deep Learning for Routing Problems" by Chaitanya K. Joshi and Rishabh Anand. We unified seminal papers tackling Travelling Salesperson and Vehicle Routing Problems via a unified pipeline in order to formulate a mental map of the field. We then explored some exciting recent research directions and avenues for future innovation through the lens of the pipeline. Blogpost: https://www.chaitjo.com/post/deep-learning-for-routing-problems/ Twitter Thread: https://twitter.com/chaitjo/status/1500463138457722888 submitted by /u/chaitjo [link] [comments]  ( 1 min )
    [R] A 4 step proof that value baselines don't affect policy grads in #RL😀Just the log-trick & Fubini gets u there!
    submitted by /u/Ok_Can2425 [link] [comments]  ( 1 min )
    [D] First Author Interview: AI & formal math (Formal Mathematics Statement Curriculum Learning)
    https://youtu.be/kl3aBni87jg This is an interview with Stanislas Polu, research engineer at OpenAI and first author of the paper "Formal Mathematics Statement Curriculum Learning". Watch the paper review here: https://youtu.be/lvYVuOmUVs8 ​ OUTLINE: 0:00 - Intro 2:00 - How do you explain the big public reaction? 4:00 - What's the history behind the paper? 6:15 - How does algorithmic formal math work? 13:10 - How does expert iteration replace self-play? 22:30 - How is the language model trained and used? 30:50 - Why is every model fine-tuned on the initial state? 33:05 - What if we want to prove something we don't know already? 40:35 - How can machines and humans work together? 43:40 - Aren't most produced statements useless? 46:20 - A deeper look at the experimental results 50:10 - What were the high and low points during the research? 54:25 - Where do we go from here? ​ Paper: https://arxiv.org/abs/2202.01344 miniF2F benchmark: https://github.com/openai/miniF2F Follow Stan here: https://twitter.com/spolu submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [P] Small-Text: Active Learning for Text Classification in Python
    Hi r/MachineLearning, Over the past few months, I have written a lot of code for my own Active Learning experiments, which has gradually been consolidated into its own library. Small-Text provides state-of-the-art Active Learning for Text Classification. Several components are provided, which are abstracted via generic interfaces, so that you can easily mix and match many classifiers and query strategies to build Active Learning experiments or applications. Features: Provides unified interfaces for Active Learning so that you can easily mix and match query strategies with classifiers provided by sklearn, Pytorch, or transformers. Supports GPU-based Pytorch models and integrates transformers so that you can use state-of-the-art Text Classification models for Active Learning. GPU is still optional: In case of a CPU-only use case, a lightweight installation requires only minimal dependencies. Multiple scientifically evaluated components are pre-implemented and ready to use (query strategies, initialization strategies, and stopping criteria). GitHub: https://github.com/webis-de/small-text Preprint: https://arxiv.org/abs/2107.10314 submitted by /u/chschroeder [link] [comments]  ( 1 min )
    [P] Looking for feedback on python package
    I created a python package to reduce the friction when working with Jupyter notebooks on a team. I've worked with notebooks for ML proof-of-concepts or prototyping but I've often had little annoyances when trying to share them with other team members. To address those, I've created databooks.dev ! I was looking for some community feedback but was also interested in other issues and solutions anyone has encountered when working with Jupyter notebooks. Thanks! submitted by /u/murilo-cunha [link] [comments]  ( 1 min )
    Regression where labeled target has some uncertainty [P]
    Let’s say I want to predict Y (continuous), and I have a dataset with Y and features X1…X10. However, a portion (let’s say 20%) of the Y values are approximation, and the remainder (80%) are accurately labeled Y values. How could I handle this during training a simple tree-based model? I was thinking just adding a column, X11, that is a binary: 1 if it’s a true Y value and 0 if it’s an approximation. An example of such a dataset could be predicting the selling price of a product where 80% of the dataset has the actual selling price (Y) and 20% only have the asking price (Y approximation), but not what the product ACTUALLY sold for. submitted by /u/foreignparent [link] [comments]  ( 1 min )
    [R] Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification - Link to a free online lecture by the author in comments
    submitted by /u/pinter69 [link] [comments]  ( 1 min )
    [D] Submitting to non-top conferences
    Is it worthy submitting to conferences like CIKM or ICDM or ECML, since they are not considered as top conferences? And what does these conferences need to be considered as top conferences by public? submitted by /u/letaf11345 [link] [comments]  ( 1 min )
    [P] - These Nebulae Do Not Exist
    submitted by /u/coffee869 [link] [comments]
    [D] In your experience, to what extent is model training a stochastic process?
    I.e. if you train the same model on the same data 100 times, how much variation in model performance have you experienced? How much has this depended on data/the model? submitted by /u/blahbloopooo [link] [comments]  ( 1 min )
    Research freedom in industry? [R]
    How much research freedom do you have as eg. a research scientist at Google? Is there eg. a set amount of time at work you can use to pursue your own ideas? submitted by /u/matriculus8 [link] [comments]  ( 2 min )
    "[Research]" How should variable-length input data be handled during the testing stage?
    I have data that is sequential. Here, I am showing a toy example of my data in the following image: ​ https://preview.redd.it/z0gkjw8inpl81.png?width=1335&format=png&auto=webp&s=88ece3bbde83d7b1fdfa45f76a97fa2e7105c010 I need to input the data into the model as groups of samples based on the class duration. To accomplish this, I determined the class's beginning and ending samples. For instance: I feed the data sample segment from S1 to S100 with the label "Class A." Then, for class B, we feed only nine samples, and so on. The following points stand out from the data: The model's input would be of variable length. Some classes may have very few samples, while others have many, such as the third segment (Class A), which has nearly 3000 samples. Questions: How can we better feed variable length input to deep learning models? It is simple to identify the beginning and endpoints of specific class samples during training. However, how do we feed this type of data into testing? More specifically, how should we determine the start and end times of the class samples at the testing time? Which deep learning method, such as CNN, LSTM, or a combination of these methods, performs better on this type of data? As can be seen from the data, some segments are excessively large, while others are excessively small. I am curious how the GPU will manage memory allocation for these large segment samples? submitted by /u/Nafees060 [link] [comments]  ( 2 min )
    [R] End-to-End Referring Video Object Segmentation with Multimodal Transformers
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [P] Pre-Processing Needed for Studio Images?
    Hey y'all, I am starting a project where we want to take a picture of a pair of sneakers on your feet and have our neural net identify that pair of sneakers. Now, when it comes to the dataset, we of course could find a different set of images with more natural, on foot situations. However, for the sake of problem solving, I was wondering if there was a creative way to leverage clean, studio images (from all angles of the shoe), such as this, without the accuracy issues that may occur from translating such a clean image to a real world situation where there is now a background and foot in the shoe. Any help is appreciated! submitted by /u/SeattleSloths [link] [comments]  ( 1 min )
    [R]: Lessons Learned on Language Model Safety and Misuse
    submitted by /u/giugiacaglia [link] [comments]
    Any research groups applying ML to physics experimentation? [R]
    Other than in medical physics/x-ray imaging which is very common List any you know of here; Im interested in this domain submitted by /u/matriculus8 [link] [comments]  ( 1 min )
  • Open

    How to Pick the Right URL for Your Landing Page
    Currently, there are 1.8 billion active websites on the internet, with more on the way. As a result, business owners must think of innovative strategies to attract new clients. The landing page is a useful marketing tool for this. It is, in fact, a requirement for any internet business. And having the right URL is… Read More »How to Pick the Right URL for Your Landing Page The post How to Pick the Right URL for Your Landing Page appeared first on Data Science Central.  ( 4 min )
  • Open

    Artificial Intelligence in the News
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Microsoft Research Presents COMPASS
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Deploy ML models using GCP's Google AI Platform - A Detailed Tutorial
    I wrote a step-wise tutorial to demonstrate the steps required to deploy an ML model using GCP's Google AI Platform and using Streamlit to access the model API through a UI. Check out the blog here - https://shreyansh26.github.io/post/2022-03-06_model_deployment_using_gcp_google_ai_platform/ submitted by /u/shreyansh26 [link] [comments]
    True AGI Aims To Provide Artificial General Intelligence As A Service (AGIaaS) In 2025
    submitted by /u/getrich_or_diemining [link] [comments]  ( 1 min )
    Latest 3D AI is born for Escher
    submitted by /u/glenniszen [link] [comments]  ( 1 min )
    Microsoft’s PyTorch-DirectML Release-2 Now Works with Python Versions 3.6, 3.7, 3.8, and Includes Support for GPU Device Selection to Train Machine Learning Models
    Machine learning (ML) is becoming a more critical application for developers since it allows them to train models that can do various prediction-based activities. You may have had to develop a complicated rules engine in the past, relying on mathematical methodologies to offer the essential statistical models. Predictions are what we call machine learning (ML) outputs, although they may be anything. They can be detected items if you use computer vision. They’re intent or translations if you’re utilizing a language model. Whatever the result, it’s a statistically weighted answer with a confidence level that may be used to verify any results. Working with machine learning has two components. If you have a prebuilt model, you may use a REST API to interact with its predictions on a cloud platform like Azure ML or export it in the widely accepted ONNX (Open Neural Network Exchange) format and run it on a PC using tools like WinML. That’s the simple part; training and evaluating a model is complex. That method necessitates a large amount of data to be verified and tagged. Also, there is a substantial amount of computing on a CPU or on a GPGPU (general-purpose GPU). Continue Reading My Summary on PyTorch-DirectML Release-2 Download: https://pypi.org/project/pytorch-directml/ Github: https://github.com/microsoft/DirectML submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Where does one start?
    I want to create AI software I know the basic theory behind machine learning, weights, biases, inputs, outputs, simplifying tastes into simpler and simpler yes and no questions, but that's basically it. I don't even properly know a single programming language. Though have started courses and experimented a bit with a few. If one were to start there journey into AI where should they start? What languages? What videos, tutorials, and glossaries do you recommend? What software? What add-ons? Where should I search for more information? What helped you to learn? What are some checkpoints/milestones to aim for? How do you go from creating an AI that identifies things to creating an AI that creates things? Some of the things I want to create in the future feel free to ignore: An augmented re…  ( 3 min )
  • Open

    Be not afraid of these perfectly normal golf courses
    What would an AI who's never seen or heard of golf courses do when shown a list of real golf course names and challenged to generate more? When Jeff Kissel sent me 15,626 existing golf course names from the National Course Rating Database, I thought I might  ( 3 min )
    When GPT-3 tries to generate GPT-2 golf courses
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    This guy made a self-driving car in vanilla javascript [code and tutorial in the original post comments]
    submitted by /u/rand3289 [link] [comments]  ( 1 min )

  • Open

    How can we classify an image using AI? Answer is in this video :)
    submitted by /u/neuronovesite_cz [link] [comments]
    Pianist AI> Artificial intelligence composer> Level 7
    About a year ago, I started the pianist AI project with the aim of having an AI model that can generate piano pieces. Although the optimization is still in process, today, finally it seems the model has learned the basic concepts. I have named the first piece of Level 7: Peace> https://youtu.be/rLW3KwCG41M With the hope of better tomorrow…. submitted by /u/amin_mlm [link] [comments]  ( 1 min )
  • Open

    How to choose the correct flow control scheme?
    submitted by /u/Just0by [link] [comments]  ( 1 min )
    New AI Platform Enabling Anyone To Create Content
    I am incredibly excited to be officially announcing the beta launch of my new startup Quasi! For those who don’t know, Quasi is an online platform where people can utilize the power of AI to create anything they want with absolutely zero code. With Quasi, anyone can create, share, and sell AI-generated content. I truly believe that AI will play a huge role in the future of content creation and my co-founder and I want to build the platform that enables that. Whether you're a content creator or a content consumer, Quasi has something for you. Check us out at https://quasi.market/Home and feel free to message us with any questions, comments, and thoughts. submitted by /u/shantanuroy4141 [link] [comments]  ( 1 min )
    KAIST Researchers Propose DIGAN: An Implicit Neural Representation (INR)-based Generative Adversarial Network (GAN) for Video Generation Using Machine Learning
    Deep generative models have produced realistic samples in a variety of domains, including image and audio. Video generation has recently emerged as the next issue for deep generative models, prompting a long line of research to learn video distribution. Despite their efforts, there is still a big gap between large-scale real-world recordings and simulations. The intricacy of video signals, which are continuously coupled across spatiotemporal directions, contributes to the difficulty of video creation. Specifically, most previous works have modeled the video as a 3D grid of RGB values, i.e., a succession of 2D images, using discrete decoders such as convolutional or autoregressive networks. However, because of the cubic complexity, such discrete modeling limits the scalability of created movies and misses the intrinsic continuous temporal dynamics. Continue Reading My Article Summary On This Research Paper: https://openreview.net/pdf?id=Czsdv-S4-w9 Github: https://github.com/sihyun-yu/digan Project: https://sihyun-yu.github.io/digan/ ​ https://reddit.com/link/t7f7ti/video/nicll8ujvll81/player submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    The Top 5 AI Articles of February on Hackernoon!
    submitted by /u/OnlyProggingForFun [link] [comments]
    Check Out These Paintings Entirely Generated by AI
    submitted by /u/MarS_0ne [link] [comments]
    How far are we from AGI?
    submitted by /u/NoAstronomer5087 [link] [comments]  ( 1 min )
    Google Researchers Use Machine Learning Approach To Annotate Protein Domains
    Proteins play an important part in the construction and function of all living organisms. Each protein is made up of a chain of amino acid building blocks. Much like an image might have numerous things, a protein can have multiple components, known as protein domains. Researchers have been extensively studying the challenging task of understanding the relationship between a protein’s amino acid sequence and its structure or function. Many people are familiar with DeepMind’s AlphaFold, which uses computational methods to predict protein structure from amino acid sequences. While existing methods have successfully predicted the function of hundreds of millions of proteins, many more remain unidentified. The difficulty of reliably predicting function for highly divergent sequences is becoming increasingly serious as the volume and diversity of protein sequences in public databases grows rapidly. Continue Reading My Summary on this research Paper: https://www.nature.com/articles/s41587-021-01179-w.epdf https://i.redd.it/mycw25zhbil81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    [D] OpenAI tackles Math - Formal Mathematics Statement Curriculum Learning (Paper Explained Video)
    https://youtu.be/lvYVuOmUVs8 Formal mathematics is a challenging area for both humans and machines. For humans, formal proofs require very tedious and meticulous specifications of every last detail and results in very long, overly cumbersome and verbose outputs. For machines, the discreteness and sparse reward nature of the problem presents a significant problem, which is classically tackled by brute force search, guided by a couple of heuristics. Previously, language models have been employed to better guide these proof searches and delivered significant improvements, but automated systems are still far from usable. This paper introduces another concept: An expert iteration procedure is employed to iteratively produce more and more challenging, but solvable problems for the machine to train on, which results in an automated curriculum, and a final algorithm that performs well above the previous models. OpenAI used this method to even solve two problems of the international math olympiad, which was previously infeasible for AI systems. ​ OUTLINE: 0:00 - Intro 2:35 - Paper Overview 5:50 - How do formal proofs work? 9:35 - How expert iteration creates a curriculum 16:50 - Model, data, and training procedure 25:30 - Predicting proof lengths for guiding search 29:10 - Bootstrapping expert iteration 34:10 - Experimental evaluation & scaling properties 40:10 - Results on synthetic data 44:15 - Solving real math problems 47:15 - Discussion & comments ​ Paper: https://arxiv.org/abs/2202.01344 miniF2F benchmark: https://github.com/openai/miniF2F submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] How can U-Net be improved on for ISLES 2018 dataset?
    Hey all, So i’m in my final year at uni and i’m currently doing a deep learning project that involves segmenting medical imagery. I’m currently using U-Net however my supervisor expects me to some how improve on it but every idea I have has already been done. So i’m not entirely sure what to do. I was considering using transformers/ attention modules but i’m not sure where to look and how to do it. Any ideas regarding this? To add, I am currently using the ISLES 2018 dataset which features CT imaging submitted by /u/JamDiddie [link] [comments]  ( 1 min )
    [D] Need advice for masters project on vision transformers
    Hi, So my project is to do with object detection on trash in the wild on this fairly obscure dataset: http://tacodataset.org/ and I was thinking of applying vision transformers to it for feature extraction. I was thinking of taking the YOLOX implementation and swapping out the backbone with swin transformers and perform bunch of comparisons/experiments for the write up. Sort of like how they applied swin transformers to mask R-CNN here but I am struggling to understand where to begin. From what I understand the swin transformer outputs a single dimension feature vector and the yolo head takes inputs from 3 different layers from the backbone?? and I think I will need to write the backbone implementation here. My questions: Does it make sense to apply a vision transformer in this way for a masters level project? (combining two separate ideas) What sort of tasks would I need to do for this implementation? submitted by /u/daanial11 [link] [comments]  ( 1 min )
    [N] New AI Platform Enabling Anyone To Create Content
    Hey! I’m Shantanu. This is Thomas. We are college students that have always wanted to create so many different things, from poetry to art to stories. We also struggled finding personalized content that was really for us, like a curated playlist or a song for the mood you are feeling. So, after spending countless hours on Figma and writing code in React, we are so excited to announce Quasi. Quasi is an online platform where people can utilize the power of AI to create anything they want with absolutely zero code. You can create things like poems, stories as well as getting personalized song choices and so much more. With Quasi, you will be able to Create Your Own Content. Go to quasi.market to check us out! submitted by /u/shantanuroy4141 [link] [comments]  ( 1 min )
    [D]How to get into ML if I have a BSc in Physics?
    I have recently graduated with a degree in physics, and tbh, I don’t really see myself doing physics. I’m thinking of switching to ML, and i’m wondering where i should start? I’m already good with python, and used libraries like Pandas, Scipy, matplotlib, numpy,…, etc before. I have a good knowledge on linear algebra, calculus, differential geometry, complex analysis, and in general the courses that a physics student would be taking in their undergrad years. So, what could be a good starting point? I’m currently following a 10 hours course on youtube by a channel called edureka. submitted by /u/Background-Tune-3249 [link] [comments]  ( 2 min )
    [R]: Federated Learning with Formal Differential Privacy Guarantees
    submitted by /u/giugiacaglia [link] [comments]
    High School Junior Interested in MATLAB [D]
    I am a high school junior, soon to be senior. I wanted to learn ML, I started with Andrew Ng's course. Initially, I thought it was perfect as it used Octave/MATLAB and I have no knowledge of Python but I heard that both these programs are not really prevalent as of now and many newer courses are bing taught in Python. I am a total noob in Python, now I am confused regarding what I should after Andrew Ng's initial course. Should I learn Python before starting with anything else? Btw, should I try coding Andrew's assignments in Python, I have close to 0 knowledge though. Edit: TITLE- High School Junior Interested in ML submitted by /u/Sad-Revenue-5908 [link] [comments]  ( 2 min )
    [D] Base predictor vs. Bagged predictor and Boosted predictors
    Hey ML community. Suppose I have some arbitrary parameterized model F that I want to train on a classification or regression task. Is there any literature that says if/when using a bagged or boosted version of F is worth trying? Is there a performance boost that is relatively likely to occur from using a bagged or boosted version of F, or is it completely situation dependent? Edit: I am interested in deep learning modules (RNNs, Linear, CNNs) in particular, but I welcome anything useful regarding classic ML models. More sophisticated architectures like ResNets,Transformers, NTMs, DNCs in a bagged or boosted setting is of particular interest. submitted by /u/lambdasintheoutfield [link] [comments]  ( 1 min )
    [D] What metrics determine if a PhD program/research lab is good?
    Hi all, ​ I'm looking for advice from those of you who have completed or are currently enrolled in a PhD program. What, in your opinion, makes a research lab/program stand out as being a desirable place to complete a PhD? ​ Thanks submitted by /u/c9m9 [link] [comments]  ( 2 min )
    [D] How should I find the tolerance for these processes using ML?
    I am studying a chemical process. The process has 4 steps. Each step will have some error, the error could finalize contribute to failing the produce (final result). Now I need to find out what would be the acceptance range of each step. How should I approach this problem? I am thinking to use classification tree. Any better option? submitted by /u/superhero_io [link] [comments]  ( 1 min )
    [D] The 2030 Self-Driving Car bet between Jeff Atwood and John Carmack
    submitted by /u/hardmaru [link] [comments]
    [R] SeamlessGAN: Self-Supervised Synthesis of Tileable Texture Maps
    submitted by /u/crp1994 [link] [comments]  ( 2 min )
    [D] Worthwhile books on ethical machine learning for someone with background knowledge
    Recently I read Cathy O’Neils Weapons of Math Destruction book. Whilst I found the topic enjoyable and for the most part an engaging read, I was left with the feeling that it only skims the surface from a technical viewpoint and works as a good introductory piece on the dangers of machine learning. I’ve been working with machine learning techniques for a number of years now but I’m interested in delving into the ethical and XAI components of machine learning. Would you have any must-reads for this area, whether that be technical works or reflections on current ML usage? I’m also open to any AI books in general that you found particularly interesting. submitted by /u/BlueGrassGrapes [link] [comments]  ( 1 min )
    [D] Paper Explained – ConvNeXt: A ConvNet for the 2020s
    https://youtu.be/QqejV0LNDHA This video explains the “A ConvNet for the 2020s.” paper by Liu et al. 2021. Paper link: https://arxiv.org/abs/2201.03545 Official code: https://github.com/facebookresearch/ConvNeXt Short description: Can a ConvNet outperform a Vision Transformer? What kind of modifications do we have to apply to a ConvNet to make it as powerful as a Transformer? Spoiler: it’s not attention. ​ Outline: 00:00 A ConvNet for the 2020s 01:58 Weights & Biases (Sponsor) 03:10 Why bother? 04:40 The perks of ConvNets (CNNs) 06:51 Pros and cons of Transformers 09:54 From ConvNets to ConvNeXts 15:54 Lessons? submitted by /u/AICoffeeBreak [link] [comments]  ( 1 min )
    [D] Where to find a benchmark of AMD GPUs with Pytorch?
    Do you know a benchmark where AMD consumer card performance with Pytorch is compared to NVidia cards? Something like https://www.videocardbenchmark.net/high_end_gpus.html but for Pytorch/Deeplearning performance. I suppose these G3D values are not representative for Deeplearning? Of course, I tried researching that, but all I found was some vague statements about AMD and ROCm from one year ago. It's hard to find out what happened since. Can we expect AMD consumer cards to be fine with Pytorch neural network training today? If so, then benchmark numbers would be good. If not, then what about the near future? Can you please summarize the technologies to watch? I had the impression CUDA is a proprietary library that only works with NVidia. OpenCL is some open source standard, but it does not use all the card's features and does not perform great for neural network training. ROCm is some (open source?) AMD library, but it's not clear how it performs and maybe it's only for enterprise cards? Anything else? Please correct or add. submitted by /u/Gere1 [link] [comments]  ( 1 min )
    [D] How do you stay abreast of the latest developments in the field of AI/ML/Big Data ?
    As someone with ML Engineer/Data Engineer experience, my friends who are currently pursuing PhD in AI/ML/Big Data are my only source of updates on recent developments in AI/ML fields. Other than that, I've started to lurk around relevant subs on reddit and certain hashtags on LinkedIn to see what's going on in the industry as such. But I wonder how do you all stay informed? Are there any particular websites that you follow? How does the research community do it? submitted by /u/money_noob_007 [link] [comments]  ( 3 min )
    [R] Deep Sequence Modeling for Pressure Controlled Mechanical Ventilation
    This paper presents a deep neural network approach to simulate the pressure of a mechanical ventilator. The traditional mechanical ventilator has a control pressure monitored by a medical practitioner, which could behave inaccurately by missing the proper pressure. This paper exploits recent studies and provides a simulator based on a deep sequence model to predict the airway pressure in the respiratory circuit during the inspiratory phase of a breath given a time series of control parameters and lung attributes. This approach demonstrates the effectiveness of neural network-based controllers in tracking pressure waveforms significantly better than the current industry standard and provides insights to build effective and robust pressure-controlled mechanical ventilators. - Paper Link: https://www.medrxiv.org/content/10.1101/2022.03.02.22271790v1 - Code Link: https://github.com/abdelghanibelgaid/nnvpp submitted by /u/xabmc [link] [comments]  ( 1 min )
    [D] What language do you perform your machine learning algorithms in?
    Excluding preprocessing and data visualisation View Poll submitted by /u/fouried96 [link] [comments]
  • Open

    Does any way to map the RAM coordinates to pixel coordinates for the Montezuma game in the gym?
    I am working with the ram version of this game for a master thesis, and I need to know the location for certain items as the keys and the stairs. The coordinates of x and y of the RAM oscillate in the interval [0, 255]; however, the image's shape that I can visualize is (210,160), and it seems impossible to discover a way to map the ram coordinates to the pixel coordinates. ​ I would be very grateful if anyone could help me with this topic. submitted by /u/SrPinko [link] [comments]  ( 1 min )
    Determining Robot distance to goal: Mapless Navigation
    Hi, I would like to how can we determine robot distance from the goal incase of mapless navigation when the model is implemented on real robot. In simulation we can used the coordinated to do so, but what about the actual implementation? Do we need some additional sensors? I am mainly interested in indoor navigation using only a camera and DRL. Ref paper: https://www.cs.ox.ac.uk/files/11898/SnapNav.pdf Look at the reward function Thanks submitted by /u/ahsol360 [link] [comments]  ( 1 min )
    How to take gradient of log policy when actions are negative?
    I am currently trying to train Bipedal Walker of OpenAI gym by using policy gradient approach. My action space contains 4 continuous actions, all ranging [-1.0, 1.0]. In this case, how can we calculate gradient of log policy when ``````` log(negative_action) gives nan value. submitted by /u/Better-Ad8608 [link] [comments]  ( 1 min )
  • Open

    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v3 [stat.ML] UPDATED)
    Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
    T-Cal: An optimal test for the calibration of predictive models. (arXiv:2203.01850v1 [stat.ML])
    The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
    KamNet: An Integrated Spatiotemporal Deep Neural Network for Rare Event Search in KamLAND-Zen. (arXiv:2203.01870v1 [physics.ins-det])
    Rare event searches allow us to search for new physics at energy scales inaccessible with other means by leveraging specialized large-mass detectors. Machine learning provides a new tool to maximize the information provided by these detectors. The information is sparse, which forces these algorithms to start from the lowest level data and exploit all symmetries in the detector to produce results. In this work we present KamNet which harnesses breakthroughs in geometric deep learning and spatiotemporal data analysis to maximize the physics reach of KamLAND-Zen, a kiloton scale spherical liquid scintillator detector searching for neutrinoless double beta decay ($0\nu\beta\beta$). Using a simplified background model for KamLAND we show that KamNet outperforms a conventional CNN on benchmarking MC simulations with an increasing level of robustness. Using simulated data, we then demonstrate KamNet's ability to increase KamLAND-Zen's sensitivity to $0\nu\beta\beta$ and $0\nu\beta\beta$ to excited states. A key component of this work is the addition of an attention mechanism to elucidate the underlying physics KamNet is using for the background rejection.
    Adversarial Graph Contrastive Learning with Information Regularization. (arXiv:2202.06491v2 [cs.LG] UPDATED)
    Contrastive learning is an effective unsupervised method in graph representation learning. Recently, the data augmentation based contrastive learning method has been extended from images to graphs. However, most prior works are directly adapted from the models designed for images. Unlike the data augmentation on images, the data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples, which are the key to the performance of contrastive learning models. This leaves much space for improvement over the existing graph contrastive learning frameworks. In this work, by introducing an adversarial graph view and an information regularizer, we propose a simple but effective method, Adversarial Graph Contrastive Learning (ARIEL), to extract informative contrastive samples within a reasonable constraint. It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets and further improves the robustness of graph contrastive learning.
    Joint Probability Estimation Using Tensor Decomposition and Dictionaries. (arXiv:2203.01667v1 [cs.LG])
    In this work, we study non-parametric estimation of joint probabilities of a given set of discrete and continuous random variables from their (empirically estimated) 2D marginals, under the assumption that the joint probability could be decomposed and approximated by a mixture of product densities/mass functions. The problem of estimating the joint probability density function (PDF) using semi-parametric techniques such as Gaussian Mixture Models (GMMs) is widely studied. However such techniques yield poor results when the underlying densities are mixtures of various other families of distributions such as Laplacian or generalized Gaussian, uniform, Cauchy, etc. Further, GMMs are not the best choice to estimate joint distributions which are hybrid in nature, i.e., some random variables are discrete while others are continuous. We present a novel approach for estimating the PDF using ideas from dictionary representations in signal processing coupled with low rank tensor decompositions. To the best our knowledge, this is the first work on estimating joint PDFs employing dictionaries alongside tensor decompositions. We create a dictionary of various families of distributions by inspecting the data, and use it to approximate each decomposed factor of the product in the mixture. Our approach can naturally handle hybrid $N$-dimensional distributions. We test our approach on a variety of synthetic and real datasets to demonstrate its effectiveness in terms of better classification rates and lower error rates, when compared to state of the art estimators.
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v1 [cs.CV])
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method calculates topological features from 3D volumetric data based on cubical complexes and uses an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.
    The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration. (arXiv:2111.15430v2 [cs.CV] UPDATED)
    In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated, resulting in over-confident predictions. Miscalibration can be exacerbated by overfitting due to the minimization of the cross-entropy during training, as it promotes the predicted softmax probabilities to match the one-hot label assignments. This yields a pre-softmax activation of the correct class that is significantly larger than the remaining activations. Recent evidence from the literature suggests that loss functions that embed implicit or explicit maximization of the entropy of predictions yield state-of-the-art calibration performances. We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. Specifically, these losses could be viewed as approximations of a linear penalty (or a Lagrangian) imposing equality constraints on logit distances. This points to an important limitation of such underlying equality constraints, whose ensuing gradients constantly push towards a non-informative solution, which might prevent from reaching the best compromise between the discriminative performance and calibration of the model during gradient-based optimization. Following our observations, we propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances. Comprehensive experiments on a variety of image classification, semantic segmentation and NLP benchmarks demonstrate that our method sets novel state-of-the-art results on these tasks in terms of network calibration, without affecting the discriminative performance. The code is available at https://github.com/by-liu/MbLS .
    BEVT: BERT Pretraining of Video Transformers. (arXiv:2112.01529v3 [cs.CV] UPDATED)
    This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 71.4\% and 87.2\% Top-1 accuracy respectively. Code will be made available at \url{https://github.com/xyzforever/BEVT}.
    Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation. (arXiv:2201.01666v2 [cs.LG] UPDATED)
    In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. As this noise is heteroscedastic, its effects can be mitigated using uncertainty-based weights in the optimization process. Previous methods rely on sampled ensembles, which do not capture all aspects of uncertainty. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL, and introduce inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks.
    Local Citation Recommendation with Hierarchical-Attention Text Encoder and SciBERT-based Reranking. (arXiv:2112.01206v2 [cs.IR] UPDATED)
    The goal of local citation recommendation is to recommend a missing reference from the local citation context and optionally also from the global context. To balance the tradeoff between speed and accuracy of citation recommendation in the context of a large-scale paper database, a viable approach is to first prefetch a limited number of relevant documents using efficient ranking methods and then to perform a fine-grained reranking using more sophisticated models. In that vein, BM25 has been found to be a tough-to-beat approach to prefetching, which is why recent work has focused mainly on the reranking step. Even so, we explore prefetching with nearest neighbor search among text embeddings constructed by a hierarchical attention network. When coupled with a SciBERT reranker fine-tuned on local citation recommendation tasks, our hierarchical Attention encoder (HAtten) achieves high prefetch recall for a given number of candidates to be reranked. Consequently, our reranker requires fewer prefetch candidates to rerank, yet still achieves state-of-the-art performance on various local citation recommendation datasets such as ACL-200, FullTextPeerRead, RefSeer, and arXiv.
    Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection. (arXiv:2111.09099v4 [cs.CV] UPDATED)
    Anomaly detection is commonly pursued as a one-class classification problem, where models can only learn from normal training samples, while being evaluated on both normal and abnormal test samples. Among the successful approaches for anomaly detection, a distinguished category of methods relies on predicting masked information (e.g. patches, future frames, etc.) and leveraging the reconstruction error with respect to the masked information as an abnormality score. Different from related methods, we propose to integrate the reconstruction-based functionality into a novel self-supervised predictive architectural building block. The proposed self-supervised block is generic and can easily be incorporated into various state-of-the-art anomaly detection methods. Our block starts with a convolutional layer with dilated filters, where the center area of the receptive field is masked. The resulting activation maps are passed through a channel attention module. Our block is equipped with a loss that minimizes the reconstruction error with respect to the masked area in the receptive field. We demonstrate the generality of our block by integrating it into several state-of-the-art frameworks for anomaly detection on image and video, providing empirical evidence that shows considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We release our code as open source at https://github.com/ristea/sspcab.
    Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. (arXiv:2111.00418v5 [cs.HC] UPDATED)
    Mobile and wearable devices have enabled numerous applications, including activity tracking, wellness monitoring, and human--computer interaction, that measure and improve our daily lives. Many of these applications are made possible by leveraging the rich collection of low-power sensors found in many mobile and wearable devices to perform human activity recognition (HAR). Recently, deep learning has greatly pushed the boundaries of HAR on mobile and wearable devices. This paper systematically categorizes and summarizes existing work that introduces deep learning methods for wearables-based HAR and provides a comprehensive analysis of the current advancements, developing trends, and major challenges. We also present cutting-edge frontiers and future directions for deep learning-based HAR.
    Terminal Embeddings in Sublinear Time. (arXiv:2110.08691v2 [cs.DS] UPDATED)
    Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} In the case that $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside is that evaluating the embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in $m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.
    Extracting Clinician's Goals by What-if Interpretable Modeling. (arXiv:2110.15165v2 [cs.LG] UPDATED)
    Although reinforcement learning (RL) has tremendous success in many fields, applying RL to real-world settings such as healthcare is challenging when the reward is hard to specify and no exploration is allowed. In this work, we focus on recovering clinicians' rewards in treating patients. We incorporate the what-if reasoning to explain the clinician's actions based on future outcomes. We use generalized additive models (GAMs) - a class of accurate, interpretable models - to recover the reward. In both simulation and a real-world hospital dataset, we show our model outperforms baselines. Finally, our model's explanations match several clinical guidelines when treating patients while we found the previously-used linear model often contradicts them.
    Generalization Guarantees for Neural Architecture Search with Train-Validation Split. (arXiv:2104.14132v3 [stat.ML] UPDATED)
    Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for high-performance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (inner problem) and various hyperparameters such as the configuration of the architecture over the validation data (outer problem). This paper explores the statistical aspects of such problems with train-validation splits. In practice, the inner problem is often overparameterized and can easily achieve zero loss. Thus, a-priori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of train-validation split. To this aim this work establishes the following results. (1) We show that refined properties of the validation loss such as risk and hyper-gradients are indicative of those of the true test loss. This reveals that the outer problem helps select the most generalizable model and prevent overfitting with a near-minimal validation sample size. This is established for continuous search spaces which are relevant for differentiable schemes. Extensions to transfer learning are developed in terms of the mismatch between training & validation distributions. (2) We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradient-descent, we show that the train-validation procedure returns the best (model, architecture) pair even if all architectures can perfectly fit the training data to achieve zero error. (3) Finally, we highlight connections between NAS, multiple kernel learning, and low-rank matrix learning. The latter leads to novel insights where the solution of the outer problem can be accurately learned via efficient spectral methods to achieve near-minimal risk.
    Patch Tracking-based Online Tensor Ring Completion for Streaming Visual Data. (arXiv:2105.14620v2 [cs.CV] UPDATED)
    Tensor completion is the problem of estimating the missing entries of a partially observed tensor by exploiting its low-rank structure. In streaming applications where the frames arrive sequentially such as video completion, the missing entries of the tensor need to be dynamically recovered in an online fashion. Traditional online completion algorithms treat the entire visual data as a tensor, which may not work satisfactorily when there is a big change in the tensor subspace along the temporal dimension, such as due to strong motion across the video frames. In this paper, we develop a novel patch tracking-based online tensor completion framework for streaming data. Each incoming tensor is extracted into small patches, and similar patches are tracked along the temporal domain. We propose a new patch tracking strategy that can accurately and efficiently track the patches with missing data. Further, a new online tensor ring completion method is proposed which can efficiently and accurately update the latent core tensors and complete the missing entries of the tracked patches. Extensive experimental results demonstrate the superior performance of the proposed algorithms compared with both online and offline state-of-the-art tensor completion methods.
    Label-Free Explainability for Unsupervised Models. (arXiv:2203.01928v1 [cs.LG])
    Unsupervised black-box models are challenging to interpret. Indeed, most existing explainability methods require labels to select which component(s) of the black-box's output to interpret. In the absence of labels, black-box outputs often are representation vectors whose components do not correspond to any meaningful quantity. Hence, choosing which component(s) to interpret in a label-free unsupervised/self-supervised setting is an important, yet unsolved problem. To bridge this gap in the literature, we introduce two crucial extensions of post-hoc explanation techniques: (1) label-free feature importance and (2) label-free example importance that respectively highlight influential features and training examples for a black-box to construct representations at inference time. We demonstrate that our extensions can be successfully implemented as simple wrappers around many existing feature and example importance methods. We illustrate the utility of our label-free explainability paradigm through a qualitative and quantitative comparison of representation spaces learned by various autoencoders trained on distinct unsupervised tasks.
    Multilingual Molecular Representation Learning via Contrastive Pre-training. (arXiv:2109.08830v2 [cs.LG] UPDATED)
    Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
    Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification. (arXiv:2111.11188v2 [cs.LG] UPDATED)
    Conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the multi-agent setting. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, we empirically observe that conservative offline RL algorithms do not work well in the multi-agent setting -- the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify a key issue that non-concavity of the value function makes the policy gradient improvements prone to local optima. Multiple agents exacerbate the problem severely, since the suboptimal policy by any agent can lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), which combines the first-order policy gradients and zeroth-order optimization methods to better optimize the conservative value functions over the actor parameters. Despite the simplicity, OMAR achieves state-of-the-art results in a variety of multi-agent control tasks.
    OCTOPUS: Overcoming Performance andPrivatization Bottlenecks in Distributed Learning. (arXiv:2105.00602v2 [cs.LG] UPDATED)
    The diversity and quantity of data warehouses, gathering data from distributed devices such as mobile devices, can enhance the success and robustness of machine learning algorithms. Federated learning enables distributed participants to collaboratively learn a commonly-shared model while holding data locally. However, it is also faced with expensive communication and limitations due to the heterogeneity of distributed data sources and lack of access to global data. In this paper, we investigate a practical distributed learning scenario where multiple downstream tasks (e.g., classifiers) could be efficiently learned from dynamically-updated and non-iid distributed data sources while providing local data privatization. We introduce a new distributed/collaborative learning scheme to address communication overhead via latent compression, leveraging global data while providing privatization of local data without additional cost due to encryption or perturbation. This scheme divides learning into (1) informative feature encoding, and transmitting the latent representation of local data to address communication overhead; (2) downstream tasks centralized at the server using the encoded codes gathered from each node to address computing overhead. Besides, a disentanglement strategy is applied to address the privatization of sensitive components of local data. Extensive experiments are conducted on image and speech datasets. The results demonstrate that downstream tasks on the compact latent representations with the privatization of local data can achieve comparable accuracy to centralized learning.
    Adaptive Informative Path Planning Using Deep Reinforcement Learning for UAV-based Active Sensing. (arXiv:2109.13570v2 [cs.RO] UPDATED)
    Aerial robots are increasingly being utilized for environmental monitoring and exploration. However, a key challenge is efficiently planning paths to maximize the information value of acquired data as an initially unknown environment is explored. To address this, we propose a new approach for informative path planning based on deep reinforcement learning (RL). Combining recent advances in RL and robotic applications, our method combines tree search with an offline-learned neural network predicting informative sensing actions. We introduce several components making our approach applicable for robotic tasks with high-dimensional state and large action spaces. By deploying the trained network during a mission, our method enables sample-efficient online replanning on platforms with limited computational resources. Simulations show that our approach performs on par with existing methods while reducing runtime by 8-10x. We validate its performance using real-world surface temperature data.
    A new harmonium for pattern recognition in survival data. (arXiv:2110.01960v2 [cs.LG] UPDATED)
    Survival analysis concerns the study of timeline data where the event of interest may remain unobserved (i.e., censored). Studies commonly record more than one type of event, but conventional survival techniques focus on a single event type. We set out to integrate both multiple independently censored time-to-event variables as well as missing observations. An energy-based approach is taken with a bi-partite structure between latent and visible states, commonly known as harmoniums (or restricted Boltzmann machines). The present harmonium is shown, both theoretically and experimentally, to capture non-linear patterns between distinct time recordings. We illustrate on real world data that, for a single time-to-event variable, our model is on par with established methods. In addition, we demonstrate that discriminative predictions improve by leveraging an extra time-to-event variable. In conclusion, multiple time-to-event variables can be successfully captured within the harmonium paradigm.
    Vector Quantized Diffusion Model for Text-to-Image Synthesis. (arXiv:2111.14822v3 [cs.CV] UPDATED)
    We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
    Label-Aware Ranked Loss for robust People Counting using Automotive in-cabin Radar. (arXiv:2110.05876v2 [eess.SP] UPDATED)
    In this paper, we introduce the Label-Aware Ranked loss, a novel metric loss function. Compared to the state-of-the-art Deep Metric Learning losses, this function takes advantage of the ranked ordering of the labels in regression problems. To this end, we first show that the loss minimises when datapoints of different labels are ranked and laid at uniform angles between each other in the embedding space. Then, to measure its performance, we apply the proposed loss on a regression task of people counting with a short-range radar in a challenging scenario, namely a vehicle cabin. The introduced approach improves the accuracy as well as the neighboring labels accuracy up to 83.0% and 99.9%: An increase of 6.7%and 2.1% on state-of-the-art methods, respectively.
    Universally Rank Consistent Ordinal Regression in Neural Networks. (arXiv:2110.07470v2 [cs.LG] UPDATED)
    Despite the pervasiveness of ordinal labels in supervised learning, it remains common practice in deep learning to treat such problems as categorical classification using the categorical cross entropy loss. Recent methods attempting to address this issue while respecting the ordinal structure of the labels have resorted to converting ordinal regression into a series of extended binary classification subtasks. However, the adoption of such methods remains inconsistent due to theoretical and practical limitations. Here we address these limitations by demonstrating that the subtask probabilities form a Markov chain. We show how to straightforwardly modify neural network architectures to exploit this fact and thereby constrain predictions to be universally rank consistent. We furthermore prove that all rank consistent solutions can be represented within this formulation, and derive a loss function producing maximum likelihood parameter estimates. Using diverse benchmarks and the real-world application of a specialized recurrent neural network for COVID-19 prognosis, we demonstrate the practical superiority of this method versus the current state-of-the-art. The method is open sourced as user-friendly PyTorch and TensorFlow packages.
    Scenario generation for market risk models using generative neural networks. (arXiv:2109.10072v2 [cs.LG] UPDATED)
    In this research, we show how to expand existing approaches of using generative adversarial networks (GANs) as economic scenario generators (ESG) to a whole internal model - with enough risk factors to model the full band-width of investments for an insurance company and for a one year time horizon as required in Solvency 2. For validation of this approach as well as for optimization of GAN architecture, we provide a consistent, data-driven framework using existing evaluation measures based on nearest neighbor distances and a newly developed measure for the detection of the memorizing effect. Finally, we demonstrate that the results of a GAN-based ESG are similar to regulatory approved internal models in Europe. Therefore, GAN-based models can be seen as an assumption-free data-driven alternative way of market risk modeling.
    Mobile-Former: Bridging MobileNet and Transformer. (arXiv:2108.05895v3 [cs.CV] UPDATED)
    We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.
    CogDL: A Toolkit for Deep Learning on Graphs. (arXiv:2103.00959v3 [cs.SI] UPDATED)
    Deep learning on graphs has attracted tremendous attention from the graph learning community in recent years. It has been widely adopted in various real-world applications from diverse domains, such as social and information networks, biological graphs, and molecular graphs. In this paper, we present CogDL--an extensive toolkit for deep learning on graphs--that allows researchers and developers to easily conduct experiments and build applications. In CogDL, we propose a unified design for the training loop of graph neural network (GNN) models, making it unique among existing graph learning libraries. By utilizing this unified trainer, we can optimize the GNN training loop with several training techniques such as distributed training and mixed precision training. Moreover, we develop efficient sparse operators for CogDL, enabling it to become the most competitive graph library for efficiency. Additionally, another important CogDL feature is its focus on ease of use with the goal of facilitating open, robust, and reproducible graph learning research. We leverage CogDL to report and maintain benchmark results on the fundamental graph tasks such as node classification and graph classification, which can be reproduced and directly used by the community. Finally, we demonstrate the effectiveness and efficiency of CogDL for real-world applications in AMiner--a large-scale academic mining and search system. The CogDL toolkit is available at: https://github.com/thudm/cogdl.
    Efficient time stepping for numerical integration using reinforcement learning. (arXiv:2104.03562v2 [math.DS] UPDATED)
    Many problems in science and engineering require an efficient numerical approximation of integrals or solutions to differential equations. For systems with rapidly changing dynamics, an equidistant discretization is often inadvisable as it either results in prohibitively large errors or computational effort. To this end, adaptive schemes, such as solvers based on Runge--Kutta pairs, have been developed which adapt the step size based on local error estimations at each step. While the classical schemes apply very generally and are highly efficient on regular systems, they can behave sub-optimal when an inefficient step rejection mechanism is triggered by structurally complex systems such as chaotic systems. To overcome these issues, we propose a method to tailor numerical schemes to the problem class at hand. This is achieved by combining simple, classical quadrature rules or ODE solvers with data-driven time-stepping controllers. Compared with learning solution operators to ODEs directly, it generalises better to unseen initial data as our approach employs classical numerical schemes as base methods. At the same time it can make use of identified structures of a problem class and, therefore, outperforms state-of-the-art adaptive schemes. Several examples demonstrate superior efficiency. Source code is available at https://github.com/lueckem/quadrature-ML.
    On-Policy Model Errors in Reinforcement Learning. (arXiv:2110.07985v2 [cs.LG] UPDATED)
    Model-free reinforcement learning algorithms can compute policy gradients given sampled environment transitions, but require large amounts of data. In contrast, model-based methods can use the learned model to generate new data, but model errors and bias can render learning unstable or suboptimal. In this paper, we present a novel method that combines real-world data and a learned model in order to get the best of both worlds. The core idea is to exploit the real-world data for on-policy predictions and use the learned model only to generalize to different actions. Specifically, we use the data as time-dependent on-policy correction terms on top of a learned model, to retain the ability to generate data without accumulating errors over long prediction horizons. We motivate this method theoretically and show that it counteracts an error term for model-based policy improvement. Experiments on MuJoCo- and PyBullet-benchmarks show that our method can drastically improve existing model-based approaches without introducing additional tuning parameters.
    Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty. (arXiv:2104.12446v2 [cs.CV] UPDATED)
    Reasoning about the future behavior of other agents is critical to safe robot navigation. The multiplicity of plausible futures is further amplified by the uncertainty inherent to agent state estimation from data, including positions, velocities, and semantic class. Forecasting methods, however, typically neglect class uncertainty, conditioning instead only on the agent's most likely class, even though perception models often return full class distributions. To exploit this information, we present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset, to investigate the impact of Perceptual Uncertainty in Prediction. It contains challenging crowded scenes with unfiltered agent class probabilities that reflect the long-tail of current state-of-the-art perception systems. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty, and enables new forecasting capabilities such as counterfactual predictions.
    R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling. (arXiv:2107.00967v2 [cs.CL] UPDATED)
    Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. This paper proposes a recursive Transformer model based on differentiable CKY style binary trees to emulate the composition process. We extend the bidirectional language model pre-training objective to this architecture, attempting to predict each word given its left and right abstraction nodes. To scale up our approach, we also introduce an efficient pruned tree induction algorithm to enable encoding in just a linear number of composition steps. Experimental results on language modeling and unsupervised parsing show the effectiveness of our approach.
    Radon cumulative distribution transform subspace modeling for image classification. (arXiv:2004.03669v3 [cs.CV] UPDATED)
    We present a new supervised image classification method applicable to a broad class of image deformation models. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scaling, and higher-order transformations are challenging to model in native image space, we show the R-CDT can capture some of these variations and thus render the associated image classification problems easier to solve. The method -- utilizing a nearest-subspace algorithm in R-CDT space -- is simple to implement, non-iterative, has no hyper-parameters to tune, is computationally efficient, label efficient, and provides competitive accuracies to state-of-the-art neural networks for many types of classification problems. In addition to the test accuracy performances, we show improvements (with respect to neural network-based methods) in terms of computational efficiency (it can be implemented without the use of GPUs), number of training samples needed for training, as well as out-of-distribution generalization. The Python code for reproducing our results is available at https://github.com/rohdelab/rcdt_ns_classifier.
    Extraction and Analysis of Highway On-Ramp Merging Scenarios from Naturalistic Trajectory Data. (arXiv:2104.05661v2 [cs.LG] UPDATED)
    Connected and Automated Vehicles (CAVs) are envisioned to transform the future industrial and private transportation sectors. However, due to the system's enormous complexity, functional verification and validation of safety aspects are essential before the technology merges into the public domain. Therefore, in recent years, a scenario-driven approach has gained acceptance, emphasizing the requirement of a solid data basis of scenarios. The large-scale research facility Test Bed Lower Saxony (TFNDS) enables the provision of ample information for a database of scenarios on highways. For that purpose, however, the scenarios of interest must be identified and extracted from the collected Naturalistic Trajectory Data (NTD). This work addresses this problem and proposes a methodology for onramp scenario extraction, enabling scenario categorization and assessment. An Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) is utilized for extraction and a decision tree with the Surrogate Measure of Safety (SMoS) Post Enroachment Time (PET) for categorization and assessment. The efficacy of the approach is shown with a dataset of NTD collected on the TFNDS.
    NUQ: A Noise Metric for Diffusion MRI via Uncertainty Discrepancy Quantification. (arXiv:2203.01921v1 [eess.IV])
    Diffusion MRI (dMRI) is the only non-invasive technique sensitive to tissue micro-architecture, which can, in turn, be used to reconstruct tissue microstructure and white matter pathways. The accuracy of such tasks is hampered by the low signal-to-noise ratio in dMRI. Today, the noise is characterized mainly by visual inspection of residual maps and estimated standard deviation. However, it is hard to estimate the impact of noise on downstream tasks based only on such qualitative assessments. To address this issue, we introduce a novel metric, Noise Uncertainty Quantification (NUQ), for quantitative image quality analysis in the absence of a ground truth reference image. NUQ uses a recent Bayesian formulation of dMRI models to estimate the uncertainty of microstructural measures. Specifically, NUQ uses the maximum mean discrepancy metric to compute a pooled quality score by comparing samples drawn from the posterior distribution of the microstructure measures. We show that NUQ allows a fine-grained analysis of noise, capturing details that are visually imperceptible. We perform qualitative and quantitative comparisons on real datasets, showing that NUQ generates consistent scores across different denoisers and acquisitions. Lastly, by using NUQ on a cohort of schizophrenics and controls, we quantify the substantial impact of denoising on group differences.
    The Uncanny Similarity of Recurrence and Depth. (arXiv:2102.11011v4 [cs.LG] UPDATED)
    It is widely believed that deep neural networks contain layer specialization, wherein neural networks extract hierarchical features representing edges and patterns in shallow layers and complete objects in deeper layers. Unlike common feed-forward models that have distinct filters at each layer, recurrent networks reuse the same parameters at various depths. In this work, we observe that recurrent models exhibit the same hierarchical behaviors and the same performance benefits with depth as feed-forward networks despite reusing the same filters at every recurrence. By training models of various feed-forward and recurrent architectures on several datasets for image classification as well as maze solving, we show that recurrent networks have the ability to closely emulate the behavior of non-recurrent deep models, often doing so with far fewer parameters.
    Reinforcement learning for pursuit and evasion of microswimmers at low Reynolds number. (arXiv:2106.08609v3 [physics.flu-dyn] UPDATED)
    We consider a model of two competing microswimming agents engaged in a pursue-evasion task within a low-Reynolds-number environment. Agents can only perform simple maneuvers and sense hydrodynamic disturbances, which provide ambiguous (partial) information about the opponent's position and motion. We frame the problem as a zero-sum game: The pursuer has to capture the evader in the shortest time, while the evader aims at deferring capture as long as possible. We show that the agents, trained via adversarial reinforcement learning, are able to overcome partial observability by discovering increasingly complex sequences of moves and countermoves that outperform known heuristic strategies and exploit the hydrodynamic environment.
    Pay Attention to Relations: Multi-embeddings for Attributed Multiplex Networks. (arXiv:2203.01903v1 [cs.LG])
    Graph Convolutional Neural Networks (GCNs) have become effective machine learning algorithms for many downstream network mining tasks such as node classification, link prediction, and community detection. However, most GCN methods have been developed for homogenous networks and are limited to a single embedding for each node. Complex systems, often represented by heterogeneous, multiplex networks present a more difficult challenge for GCN models and require that such techniques capture the diverse contexts and assorted interactions that occur between nodes. In this work, we propose RAHMeN, a novel unified relation-aware embedding framework for attributed heterogeneous multiplex networks. Our model incorporates node attributes, motif-based features, relation-based GCN approaches, and relational self-attention to learn embeddings of nodes with respect to the various relations in a heterogeneous, multiplex network. In contrast to prior work, RAHMeN is a more expressive embedding framework that embraces the multi-faceted nature of nodes in such networks, producing a set of multi-embeddings that capture the varied and diverse contexts of nodes. We evaluate our model on four real-world datasets from Amazon, Twitter, YouTube, and Tissue PPIs in both transductive and inductive settings. Our results show that RAHMeN consistently outperforms comparable state-of-the-art network embedding models, and an analysis of RAHMeN's relational self-attention demonstrates that our model discovers interpretable connections between relations present in heterogeneous, multiplex networks.
    Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?. (arXiv:2104.09425v3 [cs.LG] UPDATED)
    While additional training data improves the robustness of deep neural networks against adversarial examples, it presents the challenge of curating a large number of specific real-world samples. We circumvent this challenge by using additional data from proxy distributions learned by advanced generative models. We first seek to formally understand the transfer of robustness from classifiers trained on proxy distributions to the real data distribution. We prove that the difference between the robustness of a classifier on the two distributions is upper bounded by the conditional Wasserstein distance between them. Next we use proxy distributions to significantly improve the performance of adversarial training on five different datasets. For example, we improve robust accuracy by up to 7.5% and 6.7% in $\ell_{\infty}$ and $\ell_2$ threat model over baselines that are not using proxy distributions on the CIFAR-10 dataset. We also improve certified robust accuracy by 7.6% on the CIFAR-10 dataset. We further demonstrate that different generative models bring a disparate improvement in the performance in robust training. We propose a robust discrimination approach to characterize the impact of individual generative models and further provide a deeper understanding of why current state-of-the-art in diffusion-based generative models are a better choice for proxy distribution than generative adversarial networks.
    Few-shot Partial Multi-view Learning. (arXiv:2105.02046v2 [cs.CV] UPDATED)
    It is often the case that data are with multiple views in real-world applications. Fully exploring the information of each view is significant for making data more representative. However, real data tend to suffer from arbitrary view-missing due to various failures in data collection and pre-processing. Besides, as obtaining large-scale labeled data is laborious and expensive, the collected training samples in the target task may be scarce as well. The co-existence of these two problems makes it more challenging to achieve the pattern classification task. Currently, to our best knowledge, few appropriate methods can well-handle these two problems simultaneously. To draw more attention from the community to this challenge, we present a new task in this paper called Few-shot Partial Multi-view Learning (FPML), aiming to alleviate the view-missing effects in the low-data regime. The challenges of this task are twofold: (1) under the interference of the missing views, it is difficult to overcome the negative impact brought by data scarcity; (2) the limited number of data exacerbates information scarcity, thereby making it harder to address the view-missing problem. In this paper, we propose a novel method called Unified Gaussian Dense-anchoring (UGD) for this task. We propose to learn the unified dense Gaussian anchors for each training sample. Therefore, the incomplete multi-view data can be densely anchored into a unified representation space, where data scarcity and view missing are both relieved by the unified dense anchor representations. In particular, we also extend our FPML to the multimodal and the cross-domain scenario, and validate our UGD method on them. The extensive experiments firmly demonstrate the effectiveness of our approach.
    Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation. (arXiv:2106.10865v2 [stat.ML] UPDATED)
    The literature on "benign overfitting" in overparameterized models has been mostly restricted to regression or binary classification; however, modern machine learning operates in the multiclass setting. Motivated by this discrepancy, we study benign overfitting in multiclass linear classification. Specifically, we consider the following training algorithms on separable data: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. First, we provide a simple sufficient deterministic condition under which all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. When the data is generated from Gaussian mixtures or a multinomial logistic model, this condition holds under high enough effective overparameterization. We also show that this sufficient condition is satisfied under "neural collapse", a phenomenon that is observed in training deep neural networks. Second, we derive novel bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.
    What Makes Transfer Learning Work For Medical Images: Feature Reuse & Other Factors. (arXiv:2203.01825v1 [cs.LG])
    Transfer learning is a standard technique to transfer knowledge from one domain to another. For applications in medical imaging, transfer from ImageNet has become the de-facto approach, despite differences in the tasks and image characteristics between the domains. However, it is unclear what factors determine whether - and to what extent - transfer learning to the medical domain is useful. The long-standing assumption that features from the source domain get reused has recently been called into question. Through a series of experiments on several medical image benchmark datasets, we explore the relationship between transfer learning, data size, the capacity and inductive bias of the model, as well as the distance between the source and target domain. Our findings suggest that transfer learning is beneficial in most cases, and we characterize the important role feature reuse plays in its success.
    On the Robustness of Counterfactual Explanations to Adverse Perturbations. (arXiv:2201.09051v2 [cs.LG] UPDATED)
    Counterfactual explanations (CEs) are a powerful means for understanding how decisions made by algorithms can be changed. Researchers have proposed a number of desiderata that CEs should meet to be practically useful, such as requiring minimal effort to enact, or complying with causal models. We consider a further aspect to improve the usability of CEs: robustness to adverse perturbations, which may naturally happen due to unfortunate circumstances. Since CEs typically prescribe a sparse form of intervention (i.e., only a subset of the features should be changed), we provide two definitions of robustness, which concern, respectively, the features to change and to keep as they are. These definitions are workable in that they can be incorporated as penalty terms in the loss functions that are used for discovering CEs. To experiment with the proposed definitions of robustness, we create and release code where five data sets (commonly used in the field of fair and explainable machine learning) have been enriched with feature-specific annotations that can be used to sample meaningful perturbations. Our experiments show that CEs are often not robust and, if adverse perturbations take place, the intervention they prescribe may require a much larger cost than anticipated, or even become impossible. However, accounting for robustness in the search process, which can be done rather easily, allows discovering robust CEs systematically. Robust CEs are resilient to adverse perturbations: additional intervention to contrast perturbations is much less costly than for non-robust CEs. Our code is available at: https://github.com/marcovirgolin/robust-counterfactuals
    Automated clustering of COVID-19 anti-vaccine discourse on Twitter. (arXiv:2203.01549v1 [cs.SI])
    Attitudes about vaccination have become more polarized; it is common to see vaccine disinformation and fringe conspiracy theories online. An observational study of Twitter vaccine discourse is found in Ojea Quintana et al. (2021): the authors analyzed approximately six months' of Twitter discourse -- 1.3 million original tweets and 18 million retweets between December 2019 and June 2020, ranging from before to after the establishment of Covid-19 as a pandemic. This work expands upon Ojea Quintana et al. (2021) with two main contributions from data science. First, based on the authors' initial network clustering and qualitative analysis techniques, we are able to clearly demarcate and visualize the language patterns used in discourse by Antivaxxers (anti-vaccination campaigners and vaccine deniers) versus other clusters (collectively, Others). Second, using the characteristics of Antivaxxers' tweets, we develop text classifiers to determine the likelihood a given user is employing anti-vaccination language, ultimately contributing to an early-warning mechanism to improve the health of our epistemic environment and bolster (and not hinder) public health initiatives.
    Curvature Graph Generative Adversarial Networks. (arXiv:2203.01604v1 [cs.LG])
    Generative adversarial network (GAN) is widely used for generalized and robust learning on graph data. However, for non-Euclidean graph data, the existing GAN-based graph representation methods generate negative samples by random walk or traverse in discrete space, leading to the information loss of topological properties (e.g. hierarchy and circularity). Moreover, due to the topological heterogeneity (i.e., different densities across the graph structure) of graph data, they suffer from serious topological distortion problems. In this paper, we proposed a novel Curvature Graph Generative Adversarial Networks method, named \textbf{\modelname}, which is the first GAN-based graph representation method in the Riemannian geometric manifold. To better preserve the topological properties, we approximate the discrete structure as a continuous Riemannian geometric manifold and generate negative samples efficiently from the wrapped normal distribution. To deal with the topological heterogeneity, we leverage the Ricci curvature for local structures with different topological properties, obtaining to low-distortion representations. Extensive experiments show that CurvGAN consistently and significantly outperforms the state-of-the-art methods across multiple tasks and shows superior robustness and generalization.
    Federated Learning for Internet of Things: Applications, Challenges, and Opportunities. (arXiv:2111.07494v3 [cs.LG] UPDATED)
    Billions of IoT devices will be deployed in the near future, taking advantage of faster Internet speed and the possibility of orders of magnitude more endpoints brought by 5G/6G. With the growth of IoT devices, vast quantities of data that may contain users' private information will be generated. The high communication and storage costs, mixed with privacy concerns, will increasingly challenge the traditional ecosystem of centralized over-the-cloud learning and processing for IoT platforms. Federated Learning (FL) has emerged as the most promising alternative approach to this problem. In FL, training data-driven machine learning models is an act of collaboration between multiple clients without requiring the data to be brought to a central point, hence alleviating communication and storage costs and providing a great degree of user-level privacy. However, there are still some challenges existing in the real FL system implementation on IoT networks. In this paper, we will discuss the opportunities and challenges of FL in IoT platforms, as well as how it can enable diverse IoT applications. In particular, we identify and discuss seven critical challenges of FL in IoT platforms and highlight some recent promising approaches towards addressing them.
    The Role of Pretrained Representations for the OOD Generalization of RL Agents. (arXiv:2107.05686v3 [cs.LG] UPDATED)
    Building sample-efficient agents that generalize out-of-distribution (OOD) in real-world settings remains a fundamental unsolved problem on the path towards achieving higher-level cognition. One particularly promising approach is to begin with low-dimensional, pretrained representations of our world, which should facilitate efficient downstream learning and generalization. By training 240 representations and over 10,000 reinforcement learning (RL) policies on a simulated robotic setup, we evaluate to what extent different properties of pretrained VAE-based representations affect the OOD generalization of downstream agents. We observe that many agents are surprisingly robust to realistic distribution shifts, including the challenging sim-to-real case. In addition, we find that the generalization performance of a simple downstream proxy task reliably predicts the generalization performance of our RL agents under a wide range of OOD settings. Such proxy tasks can thus be used to select pretrained representations that will lead to agents that generalize.
    A Comparison of Latent Semantic Analysis and Correspondence Analysis of Document-Term Matrices. (arXiv:2108.06197v2 [cs.IR] UPDATED)
    Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional and dense vectors that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on two tasks, a document classification task in English and an authorship attribution task on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
    On Improving Adversarial Transferability of Vision Transformers. (arXiv:2106.04169v3 [cs.CV] UPDATED)
    Vision transformers (ViTs) process input images as sequences of patches via self-attention; a radically different architecture than convolutional neural networks (CNNs). This makes it interesting to study the adversarial feature space of ViT models and their transferability. In particular, we observe that adversarial patterns found via conventional adversarial attacks show very \emph{low} black-box transferability even for large ViT models. We show that this phenomenon is only due to the sub-optimal attack procedures that do not leverage the true representation potential of ViTs. A deep ViT is composed of multiple blocks, with a consistent architecture comprising of self-attention and feed-forward layers, where each block is capable of independently producing a class token. Formulating an attack using only the last class token (conventional approach) does not directly leverage the discriminative information stored in the earlier tokens, leading to poor adversarial transferability of ViTs. Using the compositional nature of ViT models, we enhance transferability of existing attacks by introducing two novel strategies specific to the architecture of ViT models. (i) Self-Ensemble: We propose a method to find multiple discriminative pathways by dissecting a single ViT model into an ensemble of networks. This allows explicitly utilizing class-specific information at each ViT block. (ii) Token Refinement: We then propose to refine the tokens to further enhance the discriminative capacity at each block of ViT. Our token refinement systematically combines the class tokens with structural information preserved within the patch tokens.
    On partitioning of an SHM problem and parallels with transfer learning. (arXiv:2203.01655v1 [cs.LG])
    In the current work, a problem-splitting approach and a scheme motivated by transfer learning is applied to a structural health monitoring problem. The specific problem in this case is that of localising damage on an aircraft wing. The original experiment is described, together with the initial approach, in which a neural network was trained to localise damage. The results were not ideal, partly because of a scarcity of training data, and partly because of the difficulty in resolving two of the damage cases. In the current paper, the problem is split into two sub-problems and an increase in classification accuracy is obtained. The sub-problems are obtained by separating out the most difficult-to-classify damage cases. A second approach to the problem is considered by adopting ideas from transfer learning (usually applied in much deeper) networks to see if a network trained on the simpler damage cases can help with feature extraction in the more difficult cases. The transfer of a fixed trained batch of layers between the networks is found to improve classification by making the classes more separable in the feature space and to speed up convergence.
    Exploiting locality in high-dimensional factorial hidden Markov models. (arXiv:1902.01639v3 [stat.ML] UPDATED)
    We propose algorithms for approximate filtering and smoothing in high-dimensional Factorial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is "dimension-free" in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes' rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network.
    Ease.ML/Snoopy: Towards Automatic Feasibility Studies for ML via Quantitative Understanding of "Data Quality for ML". (arXiv:2010.08410v3 [cs.LG] UPDATED)
    In our experience of working with domain experts who are using today's AutoML systems, a common problem we encountered is what we call "unrealistic expectations" -- when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning (ML). Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator that is compared against baseline feasibility study candidates on 6 datasets (with additional real and synthetic noise of different levels) in computer vision and natural language processing. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
    Understanding microbiome dynamics via interpretable graph representation learning. (arXiv:2203.01830v1 [q-bio.QM])
    Large-scale perturbations in the microbiome constitution are strongly correlated, whether as a driver or a consequence, with the health and functioning of human physiology. However, understanding the difference in the microbiome profiles of healthy and ill individuals can be complicated due to the large number of complex interactions among microbes. We propose to model these interactions as a time-evolving graph whose nodes are microbes and edges are interactions among them. Motivated by the need to analyse such complex interactions, we develop a method that learns a low-dimensional representation of the time-evolving graph and maintains the dynamics occurring in the high-dimensional space. Through our experiments, we show that we can extract graph features such as clusters of nodes or edges that have the highest impact on the model to learn the low-dimensional representation. This information can be crucial to identify microbes and interactions among them that are strongly correlated with clinical diseases. We conduct our experiments on both synthetic and real-world microbiome datasets.
    Anomaly Detection in Big Data. (arXiv:2203.01684v1 [cs.LG])
    Anomaly is defined as a state of the system that do not conform to the normal behavior. For example, the emission of neutrons in a nuclear reactor channel above the specified threshold is an anomaly. Big data refers to the data set that is \emph{high volume, streaming, heterogeneous, distributed} and often \emph{sparse}. Big data is not uncommon these days. For example, as per Internet live stats, the number of tweets posted per day has gone above 500 millions. Due to data explosion in data laden domains, traditional anomaly detection techniques developed for small data sets scale poorly on large-scale data sets. Therefore, we take an alternative approach to tackle anomaly detection in big data. Essentially, there are two ways to scale anomaly detection in big data. The first is based on the \emph{online} learning and the second is based on the \emph{distributed} learning. Our aim in the thesis is to tackle big data problems while detecting anomaly efficiently. To that end, we first take \emph{streaming} issue of the big data and propose Passive-Aggressive GMEAN (PAGMEAN) algorithms. Although, online learning algorithm can scale well over large number of data points and dimensions, they can not process data when it is distributed at multiple locations; which is quite common these days. Therefore, we propose anomaly detection algorithm which is inherently distributed using ADMM. Finally, we present a case study on anomaly detection in nuclear power plant data.
    Application of Ghost-DeblurGAN to Fiducial Marker Detection. (arXiv:2109.03379v3 [eess.IV] CROSS LISTED)
    Feature extraction or localization based on the fiducial marker could fail due to motion blur in real-world robotic applications. To solve this problem, a lightweight generative adversarial network, named Ghost-DeblurGAN, for real-time motion deblurring is developed in this paper. Furthermore, on account that there is no existing deblurring benchmark for such task, a new large-scale dataset, YorkTag, is proposed that provides pairs of sharp/blurred images containing fiducial markers. With the proposed model trained and tested on YorkTag, it is demonstrated that when applied along with fiducial marker systems to motion-blurred images, Ghost-DeblurGAN improves the marker detection significantly. The datasets and codes used in this paper are available at: https://github.com/York-SDCNLab/Ghost-DeblurGAN.
    Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback. (arXiv:2112.14332v2 [cs.LG] UPDATED)
    Due to the high cost of communication, federated learning (FL) systems need to sample a subset of clients that are involved in each round of training. As a result, client sampling plays an important role in FL systems as it affects the convergence rate of optimization algorithms used to train machine learning models. Despite its importance, there is limited work on how to sample clients effectively. In this paper, we cast client sampling as an online learning task with bandit feedback, which we solve with an online stochastic mirror descent (OSMD) algorithm designed to minimize the sampling variance. To handle the tuning parameters in OSMD that depend on the unknown problem parameters, we use the online ensemble method and doubling trick. We prove a dynamic regret bound relative to the theoretically optimal sampling sequence. The regret bound depends on the total variation of the comparator sequence, which naturally captures the intrinsic difficulty of the problem. To the best of our knowledge, these theoretical contributions are new and the proof technique is of independent interest. Through both synthetic and real data experiments, we illustrate advantages of the proposed client sampling algorithm over the widely used uniform sampling and existing online learning based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent.
    Detecting High-Quality GAN-Generated Face Images using Neural Networks. (arXiv:2203.01716v1 [cs.CR])
    In the past decades, the excessive use of the last-generation GAN (Generative Adversarial Networks) models in computer vision has enabled the creation of artificial face images that are visually indistinguishable from genuine ones. These images are particularly used in adversarial settings to create fake social media accounts and other fake online profiles. Such malicious activities can negatively impact the trustworthiness of users identities. On the other hand, the recent development of GAN models may create high-quality face images without evidence of spatial artifacts. Therefore, reassembling uniform color channel correlations is a challenging research problem. To face these challenges, we need to develop efficient tools able to differentiate between fake and authentic face images. In this chapter, we propose a new strategy to differentiate GAN-generated images from authentic images by leveraging spectral band discrepancies, focusing on artificial face image synthesis. In particular, we enable the digital preservation of face images using the Cross-band co-occurrence matrix and spatial co-occurrence matrix. Then, we implement these techniques and feed them to a Convolutional Neural Networks (CNN) architecture to identify the real from artificial faces. Additionally, we show that the performance boost is particularly significant and achieves more than 92% in different post-processing environments. Finally, we provide several research observations demonstrating that this strategy improves a comparable detection method based only on intra-band spatial co-occurrences.
    Providing reliability in Recommender Systems through Bernoulli Matrix Factorization. (arXiv:2006.03481v5 [cs.LG] UPDATED)
    Beyond accuracy, quality measures are gaining importance in modern recommender systems, with reliability being one of the most important indicators in the context of collaborative filtering. This paper proposes Bernoulli Matrix Factorization (BeMF), which is a matrix factorization model, to provide both prediction values and reliability values. BeMF is a very innovative approach from several perspectives: a) it acts on model-based collaborative filtering rather than on memory-based filtering, b) it does not use external methods or extended architectures, such as existing solutions, to provide reliability, c) it is based on a classification-based model instead of traditional regression-based models, and d) matrix factorization formalism is supported by the Bernoulli distribution to exploit the binary nature of the designed classification model. The experimental results show that the more reliable a prediction is, the less liable it is to be wrong: recommendation quality improves after the most reliable predictions are selected. State-of-the-art quality measures for reliability have been tested, which shows that BeMF outperforms previous baseline methods and models.
    Optimal Adaptive Matrix Completion. (arXiv:2002.02431v2 [cs.LG] UPDATED)
    We study the problem of exact completion for $m \times n$ sized matrix of rank r with the adaptive sampling method. We introduce a relation of the exact completion problem with the sparsest vector of column and row spaces (which we call sparsity-number here). Using this relation, we propose matrix completion algorithms that exactly recovers the target matrix. These algorithms are superior to previous works in two important ways. First, our algorithms exactly recover $\mu_0$-coherent column space matrices by probability at least $1-\epsilon$ using much smaller observations complexity than - $\mathcal{O}(\mu_0 rn \mathrm{log}\frac{r}{\epsilon})$ - the state of art. Specifically, many of the previous adaptive sampling methods require to observe the entire matrix when the column space is highly coherent. However, we show that our method is still able to recover this type of matrices by observing a small fraction of entries under many scenarios. Second, we propose an exact completion algorithm, which requires minimal pre-information as either row or column space is not being highly coherent. We provide an extension of these algorithms that is robust to sparse random noise. Besides, we propose an additional low-rank estimation algorithm that is robust to any small noise by adaptively studying the shape of column space. At the end of the paper, we provide experimental results that illustrate the strength of the algorithms proposed here.
    Quantum Reinforcement Learning via Policy Iteration. (arXiv:2203.01889v1 [quant-ph])
    Quantum computing has shown the potential to substantially speed up machine learning applications, in particular for supervised and unsupervised learning. Reinforcement learning, on the other hand, has become essential for solving many decision making problems and policy iteration methods remain the foundation of such approaches. In this paper, we provide a general framework for performing quantum reinforcement learning via policy iteration. We validate our framework by designing and analyzing: \emph{quantum policy evaluation} methods for infinite horizon discounted problems by building quantum states that approximately encode the value function of a policy $\pi$; and \emph{quantum policy improvement} methods by post-processing measurement outcomes on these quantum states. Last, we study the theoretical and experimental performance of our quantum algorithms on two environments from OpenAI's Gym.
    Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning. (arXiv:2001.00119v2 [cs.LG] UPDATED)
    Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore
    Min-Max Bilevel Multi-objective Optimization with Applications in Machine Learning. (arXiv:2203.01924v1 [cs.LG])
    This paper is the first to propose a generic min-max bilevel multi-objective optimization framework, highlighting applications in representation learning and hyperparameter optimization. In many machine learning applications such as meta-learning, multi-task learning, and representation learning, a subset of the parameters are shared by all the tasks, while each specific task has its own set of additional parameters. By leveraging the recent advances of nonconvex min-max optimization, we propose a gradient descent-ascent bilevel optimization (MORBiT) algorithm which is able to extract a set of shared parameters that is robust over all tasks and further overcomes the distributional shift between training and testing tasks. Theoretical analyses show that MORBiT converges to the first-order stationary point at a rate of $\mathcal{O}(\sqrt{n}K^{-2/5})$ for a class of nonconvex problems, where $K$ denotes the total number of iterations and $n$ denotes the number of tasks. Overall, we formulate a min-max bilevel multi-objective optimization problem, provide a single loop two-timescale algorithm with convergence rate guarantees, and show theoretical bounds on the generalization abilities of the optimizer. Experimental results on sinusoid regression and representation learning showcase the superiority of MORBiT over state-of-the-art methods, validating our convergence and generalization results.
    An Adaptive Human Driver Model for Realistic Race Car Simulations. (arXiv:2203.01909v1 [cs.LG])
    Engineering a high-performance race car requires a direct consideration of the human driver using real-world tests or Human-Driver-in-the-Loop simulations. Apart from that, offline simulations with human-like race driver models could make this vehicle development process more effective and efficient but are hard to obtain due to various challenges. With this work, we intend to provide a better understanding of race driver behavior and introduce an adaptive human race driver model based on imitation learning. Using existing findings and an interview with a professional race engineer, we identify fundamental adaptation mechanisms and how drivers learn to optimize lap time on a new track. Subsequently, we use these insights to develop generalization and adaptation techniques for a recently presented probabilistic driver modeling approach and evaluate it using data from professional race drivers and a state-of-the-art race car simulator. We show that our framework can create realistic driving line distributions on unseen race tracks with almost human-like performance. Moreover, our driver model optimizes its driving lap by lap, correcting driving errors from previous laps while achieving faster lap times. This work contributes to a better understanding and modeling of the human driver, aiming to expedite simulation methods in the modern vehicle development process and potentially supporting automated driving and racing technologies.
    Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. (arXiv:1903.08568v4 [cs.DS] UPDATED)
    We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $\nu = e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $\nu$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in R\'enyi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincar\'e inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of $f$, without requiring isoperimetry.
    Understanding Failure Modes of Self-Supervised Learning. (arXiv:2203.01881v1 [cs.LG])
    Self-supervised learning methods have shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure models and interpreting the learned representations of these models. In this paper, we tackle these issues and study the representation space of self-supervised models by understanding the underlying reasons for misclassifications in a downstream task. Over several state-of-the-art self-supervised models including SimCLR, SwaV, MoCo V2 and BYOL, we observe that representations of correctly classified samples have few discriminative features with highly deviated values compared to other features. This is in a clear contrast with representations of misclassified samples. We also observe that noisy features in the representation space often correspond to spurious attributes in images making the models less interpretable. Building on these observations, we propose a sample-wise Self-Supervised Representation Quality Score (or, Q-Score) that, without access to any label information, is able to predict if a given sample is likely to be misclassified in the downstream task, achieving an AUPRC of up to 0.90. Q-Score can also be used as a regularization to remedy low-quality representations leading to 3.26% relative improvement in accuracy of SimCLR on ImageNet-100. Moreover, we show that Q-Score regularization increases representation sparsity, thus reducing noise and improving interpretability through gradient heatmaps.
    On Robustness of Neural Ordinary Differential Equations. (arXiv:1910.05513v4 [cs.LG] UPDATED)
    Neural ordinary differential equations (ODEs) have been attracting increasing attention in various research domains recently. There have been some works studying optimization issues and approximation capabilities of neural ODEs, but their robustness is still yet unclear. In this work, we fill this important gap by exploring robustness properties of neural ODEs both empirically and theoretically. We first present an empirical study on the robustness of the neural ODE-based networks (ODENets) by exposing them to inputs with various types of perturbations and subsequently investigating the changes of the corresponding outputs. In contrast to conventional convolutional neural networks (CNNs), we find that the ODENets are more robust against both random Gaussian perturbations and adversarial attack examples. We then provide an insightful understanding of this phenomenon by exploiting a certain desirable property of the flow of a continuous-time ODE, namely that integral curves are non-intersecting. Our work suggests that, due to their intrinsic robustness, it is promising to use neural ODEs as a basic block for building robust deep network models. To further enhance the robustness of vanilla neural ODEs, we propose the time-invariant steady neural ODE (TisODE), which regularizes the flow on perturbed data via the time-invariant property and the imposition of a steady-state constraint. We show that the TisODE method outperforms vanilla neural ODEs and also can work in conjunction with other state-of-the-art architectural methods to build more robust deep networks.
    Sparse Bayesian Optimization. (arXiv:2203.01900v1 [cs.LG])
    Bayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box objective functions. However, the application of BO to areas such as recommendation systems often requires taking the interpretability and simplicity of the configurations into consideration, a setting that has not been previously studied in the BO literature. To make BO applicable in this setting, we present several regularization-based approaches that allow us to discover sparse and more interpretable configurations. We propose a novel differentiable relaxation based on homotopy continuation that makes it possible to target sparsity by working directly with $L_0$ regularization. We identify failure modes for regularized BO and develop a hyperparameter-free method, sparsity exploring Bayesian optimization (SEBO) that seeks to simultaneously maximize a target objective and sparsity. SEBO and methods based on fixed regularization are evaluated on synthetic and real-world problems, and we show that we are able to efficiently optimize for sparsity.
    On generating parametrised structural data using conditional generative adversarial networks. (arXiv:2203.01641v1 [cs.LG])
    A powerful approach, and one of the most common ones in structural health monitoring (SHM), is to use data-driven models to make predictions and inferences about structures and their condition. Such methods almost exclusively rely on the quality of the data. Within the SHM discipline, data do not always suffice to build models with satisfactory accuracy for given tasks. Even worse, data may be completely missing from one's dataset, regarding the behaviour of a structure under different environmental conditions. In the current work, with a view to confronting such issues, the generation of artificial data using a variation of the generative adversarial network (GAN) algorithm, is used. The aforementioned variation is that of the conditional GAN or cGAN. The algorithm is not only used to generate artificial data, but also to learn transformations of manifolds according to some known parameters. Assuming that the structure's response is represented by points in a manifold, part of the space will be formed due to variations in external conditions affecting the structure. This idea proves efficient in SHM, as it is exploited to generate structural data for specific values of environmental coefficients. The scheme is applied here on a simulated structure which operates under different temperature and humidity conditions. The cGAN is trained on data for some discrete values of the temperature within some range, and is able to generate data for every temperature in this range with satisfactory accuracy. The novelty, compared to classic regression in similar problems, is that the cGAN allows unknown environmental parameters to affect the structure and can generate whole manifolds of data for every value of the known parameters, while the unknown ones vary within the generated manifolds.
    DenseUNets with feedback non-local attention for the segmentation of specular microscopy images of the corneal endothelium with Fuchs dystrophy. (arXiv:2203.01882v1 [eess.IV])
    To estimate the corneal endothelial parameters from specular microscopy images depicting cornea guttata (Fuchs endothelial dystrophy), we propose a new deep learning methodology that includes a novel attention mechanism named feedback non-local attention (fNLA). Our approach first infers the cell edges, then selects the cells that are well detected, and finally applies a postprocessing method to correct mistakes and provide the binary segmentation from which the corneal parameters are estimated (cell density [ECD], coefficient of variation [CV], and hexagonality [HEX]). In this study, we analyzed 1203 images acquired with a Topcon SP-1P microscope, 500 of which contained guttae. Manual segmentation was performed in all images. We compared the results of different networks (UNet, ResUNeXt, DenseUNets, UNet++) and found that DenseUNets with fNLA provided the best performance, with a mean absolute error of 23.16 [cells/mm$^{2}$] in ECD, 1.28 [%] in CV, and 3.13 [%] in HEX, which was 3-6 times smaller than the error obtained by Topcon's built-in software. Our approach handled the cells affected by guttae remarkably well, detecting cell edges occluded by small guttae while discarding areas covered by large guttae. fNLA made use of the local information, providing sharper edges in guttae areas and better results in the selection of well-detected cells. Overall, the proposed method obtained reliable and accurate estimations in extremely challenging specular images with guttae, being the first method in the literature to solve this problem adequately. Code is available in our GitHub.
    ROCT-Net: A new ensemble deep convolutional model with improved spatial resolution learning for detecting common diseases from retinal OCT images. (arXiv:2203.01883v1 [eess.IV])
    Optical coherence tomography (OCT) imaging is a well-known technology for visualizing retinal layers and helps ophthalmologists to detect possible diseases. Accurate and early diagnosis of common retinal diseases can prevent the patients from suffering critical damages to their vision. Computer-aided diagnosis (CAD) systems can significantly assist ophthalmologists in improving their examinations. This paper presents a new enhanced deep ensemble convolutional neural network for detecting retinal diseases from OCT images. Our model generates rich and multi-resolution features by employing the learning architectures of two robust convolutional models. Spatial resolution is a critical factor in medical images, especially the OCT images that contain tiny essential points. To empower our model, we apply a new post-architecture model to our ensemble model for enhancing spatial resolution learning without increasing computational costs. The introduced post-architecture model can be deployed to any feature extraction model to improve the utilization of the feature map's spatial values. We have collected two open-source datasets for our experiments to make our models capable of detecting six crucial retinal diseases: Age-related Macular Degeneration (AMD), Central Serous Retinopathy (CSR), Diabetic Retinopathy (DR), Choroidal Neovascularization (CNV), Diabetic Macular Edema (DME), and Drusen alongside the normal cases. Our experiments on two datasets and comparing our model with some other well-known deep convolutional neural networks have proven that our architecture can increase the classification accuracy up to 5%. We hope that our proposed methods create the next step of CAD systems development and help future researches. The code of this paper is shared at https://github.com/mr7495/OCT-classification.
    Local Constraint-Based Causal Discovery under Selection Bias. (arXiv:2203.01848v1 [cs.LG])
    We consider the problem of discovering causal relations from independence constraints selection bias in addition to confounding is present. While the seminal FCI algorithm is sound and complete in this setup, no criterion for the causal interpretation of its output under selection bias is presently known. We focus instead on local patterns of independence relations, where we find no sound method for only three variable that can include background knowledge. Y-Structure patterns are shown to be sound in predicting causal relations from data under selection bias, where cycles may be present. We introduce a finite-sample scoring rule for Y-Structures that is shown to successfully predict causal relations in simulation experiments that include selection mechanisms. On real-world microarray data, we show that a Y-Structure variant performs well across different datasets, potentially circumventing spurious correlations due to selection bias.
    Robustness and Adaptation to Hidden Factors of Variation. (arXiv:2203.01864v1 [cs.LG])
    We tackle here a specific, still not widely addressed aspect, of AI robustness, which consists of seeking invariance / insensitivity of model performance to hidden factors of variations in the data. Towards this end, we employ a two step strategy that a) does unsupervised discovery, via generative models, of sensitive factors that cause models to under-perform, and b) intervenes models to make their performance invariant to these sensitive factors' influence. We consider 3 separate interventions for robustness, including: data augmentation, semantic consistency, and adversarial alignment. We evaluate our method using metrics that measure trade offs between invariance (insensitivity) and overall performance (utility) and show the benefits of our method for 3 settings (unsupervised, semi-supervised and generalization).
    Identification in Tree-shaped Linear Structural Causal Models. (arXiv:2203.01852v1 [cs.AI])
    Linear structural equation models represent direct causal effects as directed edges and confounding factors as bidirected edges. An open problem is to identify the causal parameters from correlations between the nodes. We investigate models, whose directed component forms a tree, and show that there, besides classical instrumental variables, missing cycles of bidirected edges can be used to identify the model. They can yield systems of quadratic equations that we explicitly solve to obtain one or two solutions for the causal parameters of adjacent directed edges. We show how multiple missing cycles can be combined to obtain a unique solution. This results in an algorithm that can identify instances that previously required approaches based on Gr\"obner bases, which have doubly-exponential time complexity in the number of structural parameters.
    The world seems different in a social context: a neural network analysis of human experimental data. (arXiv:2203.01862v1 [q-bio.NC])
    Human perception and behavior are affected by the situational context, in particular during social interactions. A recent study demonstrated that humans perceive visual stimuli differently depending on whether they do the task by themselves or together with a robot. Specifically, it was found that the central tendency effect is stronger in social than in non-social task settings. The particular nature of such behavioral changes induced by social interaction, and their underlying cognitive processes in the human brain are, however, still not well understood. In this paper, we address this question by training an artificial neural network inspired by the predictive coding theory on the above behavioral data set. Using this computational model, we investigate whether the change in behavior that was caused by the situational context in the human experiment could be explained by continuous modifications of a parameter expressing how strongly sensory and prior information affect perception. We demonstrate that it is possible to replicate human behavioral data in both individual and social task settings by modifying the precision of prior and sensory signals, indicating that social and non-social task settings might in fact exist on a continuum. At the same time an analysis of the neural activation traces of the trained networks provides evidence that information is coded in fundamentally different ways in the network in the individual and in the social conditions. Our results emphasize the importance of computational replications of behavioral data for generating hypotheses on the underlying cognitive mechanisms of shared perception and may provide inspiration for follow-up studies in the field of neuroscience.
    A Brief Overview of Unsupervised Neural Speech Representation Learning. (arXiv:2203.01829v1 [eess.AS])
    Unsupervised representation learning for speech processing has matured greatly in the last few years. Work in computer vision and natural language processing has paved the way, but speech data offers unique challenges. As a result, methods from other domains rarely translate directly. We review the development of unsupervised representation learning for speech over the last decade. We identify two primary model categories: self-supervised methods and probabilistic latent variable models. We describe the models and develop a comprehensive taxonomy. Finally, we discuss and compare models from the two categories.
    Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information. (arXiv:2203.01826v1 [eess.AS])
    Deep learning-based pronunciation scoring models highly rely on the availability of the annotated non-native data, which is costly and has scalability issues. To deal with the data scarcity problem, data augmentation is commonly used for model pretraining. In this paper, we propose a phone-level mixup, a simple yet effective data augmentation method, to improve the performance of word-level pronunciation scoring. Specifically, given a phoneme sequence from lexicon, the artificial augmented word sample can be generated by randomly sampling from the corresponding phone-level features in training data, while the word score is the average of their GOP scores. Benefit from the arbitrary phone-level combination, the mixup is able to generate any word with various pronunciation scores. Moreover, we utilize multi-source information (e.g., MFCC and deep features) to further improve the scoring system performance. The experiments conducted on the Speechocean762 show that the proposed system outperforms the baseline by adding the mixup data for pretraining, with Pearson correlation coefficients (PCC) increasing from 0.567 to 0.61. The results also indicate that proposed method achieves similar performance by using 1/10 unlabeled data of baseline. In addition, the experimental results also demonstrate the efficiency of our proposed multi-source approach.
    Query Processing on Tensor Computation Runtimes. (arXiv:2203.01877v1 [cs.DB])
    The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in new hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now part of the offerings of major cloud providers. Meanwhile, by hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how databases can ride the wave of innovation happening in the AI space. Specifically, we present Tensor Query Processor (TQP): a SQL query processor leveraging the tensor interface of TCRs. TQP is able to efficiently run the full TPC-H benchmark by implementing novel algorithms for executing relational operators on the specialized tensor routines provided by TCRs. Meanwhile, TQP can target various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 20x over CPU-only systems, and up to 5x over specialized GPU solutions. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 5x speedup over CPU baselines.
    NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields. (arXiv:2203.01762v1 [cs.LG])
    Deep learning has shown great potential for modeling the physical dynamics of complex particle systems such as fluids (in Lagrangian descriptions). Existing approaches, however, require the supervision of consecutive particle properties, including positions and velocities. In this paper, we consider a partially observable scenario known as fluid dynamics grounding, that is, inferring the state transitions and interactions within the fluid particle systems from sequential visual observations of the fluid surface. We propose a differentiable two-stage network named NeuroFluid. Our approach consists of (i) a particle-driven neural renderer, which involves fluid physical properties into the volume rendering function, and (ii) a particle transition model optimized to reduce the differences between the rendered and the observed images. NeuroFluid provides the first solution to unsupervised learning of particle-based fluid dynamics by training these two models jointly. It is shown to reasonably estimate the underlying physics of fluids with different initial shapes, viscosity, and densities. It is a potential alternative approach to understanding complex fluid mechanics, such as turbulence, that are difficult to model using traditional methods of mathematical physics.
    Semi-supervised Learning using Robust Loss. (arXiv:2203.01524v1 [cs.LG])
    The amount of manually labeled data is limited in medical applications, so semi-supervised learning and automatic labeling strategies can be an asset for training deep neural networks. However, the quality of the automatically generated labels can be uneven and inferior to manual labels. In this paper, we suggest a semi-supervised training strategy for leveraging both manually labeled data and extra unlabeled data. In contrast to the existing approaches, we apply robust loss for the automated labeled data to automatically compensate for the uneven data quality using a teacher-student framework. First, we generate pseudo-labels for unlabeled data using a teacher model pre-trained on labeled data. These pseudo-labels are noisy, and using them along with labeled data for training a deep neural network can severely degrade learned feature representations and the generalization of the network. Here we mitigate the effect of these pseudo-labels by using robust loss functions. Specifically, we use three robust loss functions, namely beta cross-entropy, symmetric cross-entropy, and generalized cross-entropy. We show that our proposed strategy improves the model performance by compensating for the uneven quality of labels in image classification as well as segmentation applications.
    Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data. (arXiv:2203.01752v1 [cs.LG])
    Despite enormous research interest and rapid application of federated learning (FL) to various areas, existing studies mostly focus on supervised federated learning under the horizontally partitioned local dataset setting. This paper will study the unsupervised FL under the vertically partitioned dataset setting. Accordingly, we propose the federated principal component analysis for vertically partitioned dataset (VFedPCA) method, which reduces the dimensionality across the joint datasets over all the clients and extracts the principal component feature information for downstream data analysis. We further take advantage of the nonlinear dimensionality reduction and propose the vertical federated advanced kernel principal component analysis (VFedAKPCA) method, which can effectively and collaboratively model the nonlinear nature existing in many real datasets. In addition, we study two communication topologies. The first is a server-client topology where a semi-trusted server coordinates the federated training, while the second is the fully-decentralized topology which further eliminates the requirement of the server by allowing clients themselves to communicate with their neighbors. Extensive experiments conducted on five types of real-world datasets corroborate the efficacy of VFedPCA and VFedAKPCA under the vertically partitioned FL setting. Code is available at: https://github.com/juyongjiang/VFedPCA-VFedAKPCA
    Topological data analysis of truncated contagion maps. (arXiv:2203.01720v1 [math.AT])
    The investigation of dynamical processes on networks has been one focus for the study of contagion processes. It has been demonstrated that contagions can be used to obtain information about the embedding of nodes in a Euclidean space. Specifically, one can use the activation times of threshold contagions to construct contagion maps as a manifold-learning approach. One drawback of contagion maps is their high computational cost. Here, we demonstrate that a truncation of the threshold contagions may considerably speed up the construction of contagion maps. Finally, we show that contagion maps may be used to find an insightful low-dimensional embedding for single-cell RNA-sequencing data in the form of cell-similarity networks and so reveal biological manifolds. Overall, our work makes the use of contagion maps as manifold-learning approaches on empirical network data more viable.
    Integrating Contrastive Learning with Dynamic Models for Reinforcement Learning from Images. (arXiv:2203.01810v1 [cs.LG])
    Recent methods for reinforcement learning from images use auxiliary tasks to learn image features that are used by the agent's policy or Q-function. In particular, methods based on contrastive learning that induce linearity of the latent dynamics or invariance to data augmentation have been shown to greatly improve the sample efficiency of the reinforcement learning algorithm and the generalizability of the learned embedding. We further argue, that explicitly improving Markovianity of the learned embedding is desirable and propose a self-supervised representation learning method which integrates contrastive learning with dynamic models to synergistically combine these three objectives: (1) We maximize the InfoNCE bound on the mutual information between the state- and action-embedding and the embedding of the next state to induce a linearly predictive embedding without explicitly learning a linear transition model, (2) we further improve Markovianity of the learned embedding by explicitly learning a non-linear transition model using regression, and (3) we maximize the mutual information between the two nonlinear predictions of the next embeddings based on the current action and two independent augmentations of the current state, which naturally induces transformation invariance not only for the state embedding, but also for the nonlinear transition model. Experimental evaluation on the Deepmind control suite shows that our proposed method achieves higher sample efficiency and better generalization than state-of-art methods based on contrastive learning or reconstruction.
    Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows. (arXiv:2203.01786v1 [cs.SD])
    Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for handling the aforementioned issues in the context of Normalizing Flow models. We also find this problem to be very well suited for Neural Spline flows, which is a highly expressive alternative to the more common affine-coupling mechanism in Normalizing Flows.
    Accelerated SGD for Non-Strongly-Convex Least Squares. (arXiv:2203.01744v1 [cs.LG])
    We consider stochastic approximation for the least squares regression problem in the non-strongly convex setting. We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as $O(d/t)$ while accelerating the forgetting of the initial conditions to $O(d/t^2)$. Our new algorithm is based on a simple modification of the accelerated gradient descent. We provide convergence results for both the averaged and the last iterate of the algorithm. In order to describe the tightness of these new bounds, we present a matching lower bound in the noiseless setting and thus show the optimality of our algorithm.
    ADPCM with nonlinear prediction. (arXiv:2203.01818v1 [eess.AS])
    Many speech coders are based on linear prediction coding (LPC), nevertheless with LPC is not possible to model the nonlinearities present in the speech signal. Because of this there is a growing interest for nonlinear techniques. In this paper we discuss ADPCM schemes with a nonlinear predictor based on neural nets, which yields an increase of 1-2.5dB in the SEGSNR over classical methods. This paper will discuss the block-adaptive and sample-adaptive predictions.
    Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology. (arXiv:2203.01726v1 [cs.CV])
    Monitoring biodiversity is paramount to manage and protect natural resources, particularly in times of global change. Collecting images of organisms over large temporal or spatial scales is a promising practice to monitor and study biodiversity change of natural ecosystems, providing large amounts of data with minimal interference with the environment. Deep learning models are currently used to automate classification of organisms into taxonomic units. However, imprecision in these classifiers introduce a measurement noise that is difficult to control and can significantly hinder the analysis and interpretation of data. In our study, we show that this limitation can be overcome by ensembles of Data-efficient image Transformers (DeiTs), which significantly outperform the previous state of the art (SOTA). We validate our results on a large number of ecological imaging datasets of diverse origin, and organisms of study ranging from plankton to insects, birds, dog breeds, animals in the wild, and corals. On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%, depending on the data set, and often achieving performances very close to perfect classification. The main reason why ensembles of DeiTs perform better is not due to the single-model performance of DeiTs, but rather to the fact that predictions by independent models have a smaller overlap, and this maximizes the profit gained by ensembling. This positions DeiT ensembles as the best candidate for image classification in biodiversity monitoring.
    Speech segmentation using multilevel hybrid filters. (arXiv:2203.01819v1 [eess.AS])
    A novel approach for speech segmentation is proposed, based on Multilevel Hybrid (mean/min) Filters (MHF) with the following features: An accurate transition location. Good performance in noisy environments (gaussian and impulsive noise). The proposed method is based on spectral changes, with the goal of segmenting the voice into homogeneous acoustic segments. This algorithm is being used for phoneticallysegmented speech coder, with successful results.
    Detection of Word Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. (arXiv:2203.01677v1 [cs.CL])
    Word-level adversarial attacks have shown success in NLP models, drastically decreasing the performance of transformer-based models in recent years. As a countermeasure, adversarial defense has been explored, but relatively few efforts have been made to detect adversarial examples. However, detecting adversarial examples may be crucial for automated tasks (e.g. review sentiment analysis) that wish to amass information about a certain population and additionally be a step towards a robust defense system. To this end, we release a dataset for four popular attack methods on four datasets and four models to encourage further research in this field. Along with it, we propose a competitive baseline based on density estimation that has the highest AUC on 29 out of 30 dataset-attack-model combinations. Source code is available in https://github.com/anoymous92874838/text-adv-detection.
    Fail-Safe Generative Adversarial Imitation Learning. (arXiv:2203.01696v1 [cs.LG])
    For flexible yet safe imitation learning (IL), we propose a modular approach that uses a generative imitator policy with a safety layer, has an overall explicit density/gradient, can therefore be end-to-end trained using generative adversarial IL (GAIL), and comes with theoretical worst-case safety/robustness guarantees. The safety layer's exact density comes from using a countable non-injective gluing of piecewise differentiable injections and the change-of-variables formula. The safe set (into which the safety layer maps) is inferred by sampling actions and their potential future fail-safe fallback continuations, together with Lipschitz continuity and convexity arguments. We also provide theoretical bounds showing the advantage of using the safety layer already during training (imitation error linear in the horizon) compared to only using it at test time (quadratic error). In an experiment on challenging real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.
    Why Do Machine Learning Practitioners Still Use Manual Tuning? A Qualitative Study. (arXiv:2203.01717v1 [cs.LG])
    Current advanced hyperparameter optimization (HPO) methods, such as Bayesian optimization, have high sampling efficiency and facilitate replicability. Nonetheless, machine learning (ML) practitioners (e.g., engineers, scientists) mostly apply less advanced HPO methods, which can increase resource consumption during HPO or lead to underoptimized ML models. Therefore, we suspect that practitioners choose their HPO method to achieve different goals, such as decrease practitioner effort and target audience compliance. To develop HPO methods that align with such goals, the reasons why practitioners decide for specific HPO methods must be unveiled and thoroughly understood. Because qualitative research is most suitable to uncover such reasons and find potential explanations for them, we conducted semi-structured interviews to explain why practitioners choose different HPO methods. The interviews revealed six principal practitioner goals (e.g., increasing model comprehension), and eleven key factors that impact decisions for HPO methods (e.g., available computing resources). We deepen the understanding about why practitioners decide for different HPO methods and outline recommendations for improvements of HPO methods by aligning them with practitioner goals.
    On Practical Reinforcement Learning: Provable Robustness, Scalability, and Statistical Efficiency. (arXiv:2203.01758v1 [cs.LG])
    This thesis rigorously studies fundamental reinforcement learning (RL) methods in modern practical considerations, including robust RL, distributional RL, and offline RL with neural function approximation. The thesis first prepares the readers with an overall overview of RL and key technical background in statistics and optimization. In each of the settings, the thesis motivates the problems to be studied, reviews the current literature, provides computationally efficient algorithms with provable efficiency guarantees, and concludes with future research directions. The thesis makes fundamental contributions to the three settings above, both algorithmically, theoretically, and empirically, while staying relevant to practical considerations.
    An Effective Graph Learning based Approach for Temporal Link Prediction: The First Place of WSDM Cup 2022. (arXiv:2203.01820v1 [cs.SI])
    Temporal link prediction, as one of the most crucial work in temporal graphs, has attracted lots of attention from the research area. The WSDM Cup 2022 seeks for solutions that predict the existence probabilities of edges within time spans over temporal graph. This paper introduces the solution of AntGraph, which wins the 1st place in the competition. We first analysis the theoretical upper-bound of the performance by removing temporal information, which implies that only structure and attribute information on the graph could achieve great performance. Based on this hypothesis, then we introduce several well-designed features. Finally, experiments conducted on the competition datasets show the superiority of our proposal, which achieved AUC score of 0.666 on dataset A and 0.902 on dataset B, the ablation studies also prove the efficiency of each feature. Code is publicly available at https://github.com/im0qianqian/WSDM2022TGP-AntGraph.
    The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches. (arXiv:2203.01491v1 [cs.LG])
    In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon $T$, provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with $O(\log T)$ switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guarantees lower than $o(\log T)$.
    Modularity of the ABCD Random Graph Model with Community Structure. (arXiv:2203.01480v1 [cs.SI])
    The Artificial Benchmark for Community Detection (ABCD) graph is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs with similar properties as the well-known LFR one, and its main parameter $\xi$ can be tuned to mimic its counterpart in the LFR model, the mixing parameter $\mu$. In this paper, we investigate various theoretical asymptotic properties of the ABCD model. In particular, we analyze the modularity function, arguably, the most important graph property of networks in the context of community detection. Indeed, the modularity function is often used to measure the presence of community structure in networks. It is also used as a quality function in many community detection algorithms, including the widely used Louvain algorithm.
    Ensemble Methods for Robust Support Vector Machines using Integer Programming. (arXiv:2203.01606v1 [cs.LG])
    In this work we study binary classification problems where we assume that our training data is subject to uncertainty, i.e. the precise data points are not known. To tackle this issue in the field of robust machine learning the aim is to develop models which are robust against small perturbations in the training data. We study robust support vector machines (SVM) and extend the classical approach by an ensemble method which iteratively solves a non-robust SVM on different perturbations of the dataset, where the perturbations are derived by an adversarial problem. Afterwards for classification of an unknown data point we perform a majority vote of all calculated SVM solutions. We study three different variants for the adversarial problem, the exact problem, a relaxed variant and an efficient heuristic variant. While the exact and the relaxed variant can be modeled using integer programming formulations, the heuristic one can be implemented by an easy and efficient algorithm. All derived methods are tested on random and realistic datasets and the results indicate that the derived ensemble methods have a much more stable behaviour when changing the protection level compared to the classical robust SVM model.
    Graph Representation Learning Beyond Node and Homophily. (arXiv:2203.01564v1 [cs.LG])
    Unsupervised graph representation learning aims to distill various graph information into a downstream task-agnostic dense vector embedding. However, existing graph representation learning approaches are designed mainly under the node homophily assumption: connected nodes tend to have similar labels and optimize performance on node-centric downstream tasks. Their design is apparently against the task-agnostic principle and generally suffers poor performance in tasks, e.g., edge classification, that demands feature signals beyond the node-view and homophily assumption. To condense different feature signals into the embeddings, this paper proposes PairE, a novel unsupervised graph embedding method using two paired nodes as the basic unit of embedding to retain the high-frequency signals between nodes to support node-related and edge-related tasks. Accordingly, a multi-self-supervised autoencoder is designed to fulfill two pretext tasks: one retains the high-frequency signal better, and another enhances the representation of commonality. Our extensive experiments on a diversity of benchmark datasets clearly show that PairE outperforms the unsupervised state-of-the-art baselines, with up to 101.1\% relative improvement on the edge classification tasks that rely on both the high and low-frequency signals in the pair and up to 82.5\% relative performance gain on the node classification tasks.
    Reinforcement Learning in Possibly Nonstationary Environments. (arXiv:2203.01707v1 [stat.ML])
    We consider reinforcement learning (RL) methods in offline nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimisation in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL
    An Open Challenge for Inductive Link Prediction on Knowledge Graphs. (arXiv:2203.01520v1 [cs.LG])
    An emerging trend in representation learning over knowledge graphs (KGs) moves beyond transductive link prediction tasks over a fixed set of known entities in favor of inductive tasks that imply training on one graph and performing inference over a new graph with unseen entities. In inductive setups, node features are often not available and training shallow entity embedding matrices is meaningless as they cannot be used at inference time with unseen entities. Despite the growing interest, there are not enough benchmarks for evaluating inductive representation learning methods. In this work, we introduce ILPC 2022, a novel open challenge on KG inductive link prediction. To this end, we constructed two new datasets based on Wikidata with various sizes of training and inference graphs that are much larger than existing inductive benchmarks. We also provide two strong baselines leveraging recently proposed inductive methods. We hope this challenge helps to streamline community efforts in the inductive graph representation learning area. ILPC 2022 follows best practices on evaluation fairness and reproducibility, and is available at https://github.com/pykeen/ilpc2022.
    Continuous Relaxation For The Multivariate Non-Central Hypergeometric Distribution. (arXiv:2203.01629v1 [cs.LG])
    Partitioning a set of elements into a given number of groups of a priori unknown sizes is an important task in many applications. Due to hard constraints, it is a non-differentiable problem which prohibits its direct use in modern machine learning frameworks. Hence, previous works mostly fall back on suboptimal heuristics or simplified assumptions. The multivariate hypergeometric distribution offers a probabilistic formulation of how to distribute a given number of samples across multiple groups. Unfortunately, as a discrete probability distribution, it neither is differentiable. In this work, we propose a continuous relaxation for the multivariate non-central hypergeometric distribution. We introduce an efficient and numerically stable sampling procedure. This enables reparameterized gradients for the hypergeometric distribution and its integration into automatic differentiation frameworks. We highlight the applicability and usability of the proposed formulation on two different common machine learning tasks.
    Parallel feature selection based on the trace ratio criterion. (arXiv:2203.01635v1 [cs.LG])
    The growth of data today poses a challenge in management and inference. While feature extraction methods are capable of reducing the size of the data for inference, they do not help in minimizing the cost of data storage. On the other hand, feature selection helps to remove the redundant features and therefore is helpful not only in inference but also in reducing management costs. This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST), which scales up to very large datasets. Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness. We analyzed the criterion's desirable properties theoretically. Based on the criterion, PFST rapidly finds important features out of a set of features for big datasets by first making a forward selection with early removal of seemingly redundant features parallelly. After the most important features are included in the model, we check back their contribution for possible interaction that may improve the fit. Lastly, we make a backward selection to check back possible redundant added by the forward steps. We evaluate our methods via various experiments using Linear Discriminant Analysis as the classifier on selected features. The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison. In addition, the classifier trained on the features selected by PFST not only achieves better accuracy than the ones chosen by other approaches but can also achieve better accuracy than the classification on all available features.
    Quantity over Quality: Training an AV Motion Planner with Large Scale Commodity Vision Data. (arXiv:2203.01681v1 [cs.RO])
    With the Autonomous Vehicle (AV) industry shifting towards Autonomy 2.0, the performance of self-driving systems starts to rely heavily on large quantities of expert driving demonstrations. However, collecting this demonstration data typically involves expensive HD sensor suites (LiDAR + RADAR + cameras), which quickly becomes financially infeasible at the scales required. This motivates the use of commodity vision sensors for data collection, which are an order of magnitude cheaper than the HD sensor suites, but offer lower fidelity. If it were possible to leverage these for training an AV motion planner, observing the `long tail' of driving events would become a financially viable strategy. As our main contribution we show it is possible to train a high-performance motion planner using commodity vision data which outperforms planners trained on HD-sensor data for a fraction of the cost. We do this by comparing the autonomy system performance when training on these two different sensor configurations, and showing that we can compensate for the lower sensor fidelity by means of increased quantity: a planner trained on 100h of commodity vision data outperforms one with 25h of expensive HD data. We also share the technical challenges we had to tackle to make this work. To the best of our knowledge, we are the first to demonstrate that this is possible using real-world data.
    Physics-Data Driven Machine Learning Based Model: A Hybrid Way for Nonlinear, Dynamic, and Open-loop Identification of IPMC Soft Artificial Muscles. (arXiv:2203.01616v1 [cs.LG])
    Ionic Polymer Metal Composites (IPMCs) are one of the most preferred choices among biocompatible materials for industrial and biomedical applications. Despite their advantages, some of their drawbacks include non-linear and hysteretic behavior, which complicates the modeling process. In previous works, usually autoregressive models were used to predict the behavior of an IPMC actuator. The main drawback of using an autoregressive model is that it cannot be used in mobile and real-time applications. In this study, we proposed a hybrid analytical intelligent model for an IPMC actuator. The most outstanding feature of this model is its non-autoregressive structure. The hybrid concept proposed in this study can be generalized to various problems other than IPMCs. The structure used in this work comprises an analytical model and a deep neural network, providing a non-linear, dynamic, and non-autoregressive model for the IPMC actuator. Lastly, the average NMSE achieved using the proposed hybrid model is 9.5781e-04 showing a significant drop in the error rate compared to other non-autoregressive structures.
    Faking feature importance: A cautionary tale on the use of differentially-private synthetic data. (arXiv:2203.01363v1 [cs.LG])
    Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering and selecting features. This phase often involves considerable time, and depends on the availability of data. There would be substantial value in synthetic data that permitted these steps to be carried out while, for example, data access was being negotiated, or with fewer information governance restrictions. This paper presents an empirical analysis of the agreement between the feature importance obtained from raw and from synthetic data, on a range of artificially generated and real-world datasets (where feature importance represents how useful each feature is when predicting a the outcome). We employ two differentially-private methods to produce synthetic data, and apply various utility measures to quantify the agreement in feature importance as this varies with the level of privacy. Our results indicate that synthetic data can sometimes preserve several representations of the ranking of feature importance in simple settings but their performance is not consistent and depends upon a number of factors. Particular caution should be exercised in more nuanced real-world settings, where synthetic data can lead to differences in ranked feature importance that could alter key modelling decisions. This work has important implications for developing synthetic versions of highly sensitive data sets in fields such as finance and healthcare.
    AdaFamily: A family of Adam-like adaptive gradient methods. (arXiv:2203.01603v1 [cs.LG])
    We propose AdaFamily, a novel method for training deep neural networks. It is a family of adaptive gradient methods and can be interpreted as sort of a blend of the optimization algorithms Adam, AdaBelief and AdaMomentum. We perform experiments on standard datasets for image classification, demonstrating that our proposed method outperforms these algorithms.
    $\beta$-DARTS: Beta-Decay Regularization for Differentiable Architecture Search. (arXiv:2203.01665v1 [cs.LG])
    Neural Architecture Search~(NAS) has attracted increasingly more attention in recent years because of its capability to design deep neural networks automatically. Among them, differential NAS approaches such as DARTS, have gained popularity for the search efficiency. However, they suffer from two main issues, the weak robustness to the performance collapse and the poor generalization ability of the searched architectures. To solve these two problems, a simple-but-efficient regularization method, termed as Beta-Decay, is proposed to regularize the DARTS-based NAS searching process. Specifically, Beta-Decay regularization can impose constraints to keep the value and variance of activated architecture parameters from too large. Furthermore, we provide in-depth theoretical analysis on how it works and why it works. Experimental results on NAS-Bench-201 show that our proposed method can help to stabilize the searching process and makes the searched network more transferable across different datasets. In addition, our search scheme shows an outstanding property of being less dependent on training time and data. Comprehensive experiments on a variety of search spaces and datasets validate the effectiveness of the proposed method.
    Unfolding-Aided Bootstrapped Phase Retrieval in Optical Imaging. (arXiv:2203.01695v1 [physics.optics])
    Phase retrieval in optical imaging refers to the recovery of a complex signal from phaseless data acquired in the form of its diffraction patterns. These patterns are acquired through a system with a coherent light source that employs a diffractive optical element (DOE) to modulate the scene resulting in coded diffraction patterns at the sensor. Recently, the hybrid approach of model-driven network or deep unfolding has emerged as an effective alternative because it allows for bounding the complexity of phase retrieval algorithms while also retaining their efficacy. Additionally, such hybrid approaches have shown promise in improving the design of DOEs that follow theoretical uniqueness conditions. There are opportunities to exploit novel experimental setups and resolve even more complex DOE phase retrieval applications. This paper presents an overview of algorithms and applications of deep unfolding for bootstrapped - regardless of near, middle, and far zones - phase retrieval.
    Early Time-Series Classification Algorithms: An Empirical Comparison. (arXiv:2203.01628v1 [cs.LG])
    Early Time-Series Classification (ETSC) is the task of predicting the class of incoming time-series by observing as few measurements as possible. Such methods can be employed to obtain classification forecasts in many time-critical applications. However, available techniques are not equally suitable for every problem, since differentiations in the data characteristics can impact algorithm performance in terms of earliness, accuracy, F1-score, and training time. We evaluate six existing ETSC algorithms on publicly available data, as well as on two newly introduced datasets originating from the life sciences and maritime domains. Our goal is to provide a framework for the evaluation and comparison of ETSC algorithms and to obtain intuition on how such approaches perform on real-life applications. The presented framework may also serve as a benchmark for new related techniques.
    Learning Set Functions Under the Optimal Subset Oracle via Equivariant Variational Inference. (arXiv:2203.01693v1 [cs.LG])
    Learning set functions becomes increasingly more important in many applications like product recommendation and compound selection in AI-aided drug discovery. The majority of existing works study methodologies of set function learning under the function value oracle, which, however, requires expensive supervision signals. This renders it impractical for applications with only weak supervisions under the Optimal Subset (OS) oracle, the study of which is surprisingly overlooked. In this work, we present a principled yet practical maximum likelihood learning framework, termed as EquiVSet, that simultaneously meets the following desiderata of learning set functions under the OS oracle: i) permutation invariance of the set mass function being modeled; ii) permission of varying ground set; iii) fully differentiability; iv) minimum prior; and v) scalability. The main components of our framework involve: an energy-based treatment of the set mass function, DeepSet-style architectures to handle permutation invariance, mean-field variational inference, and its amortized variants. Although the framework is embarrassingly simple, empirical studies on three real-world applications (including Amazon product recommendation, set anomaly detection and compound selection for virtual screening) demonstrate that EquiVSet outperforms the baselines by a large margin.
    Fully-Connected Network on Noncompact Symmetric Space and Ridgelet Transform based on Helgason-Fourier Analysis. (arXiv:2203.01631v1 [cs.LG])
    Neural network on Riemannian symmetric space such as hyperbolic space and the manifold of symmetric positive definite (SPD) matrices is an emerging subject of research in geometric deep learning. Based on the well-established framework of the Helgason-Fourier transform on the noncompact symmetric space, we present a fully-connected network and its associated ridgelet transform on the noncompact symmetric space, covering the hyperbolic neural network (HNN) and the SPDNet as special cases. The ridgelet transform is an analysis operator of a depth-2 continuous network spanned by neurons, namely, it maps an arbitrary given function to the weights of a network. Thanks to the coordinate-free reformulation, the role of nonlinear activation functions is revealed to be a wavelet function, and the reconstruction formula directly yields the universality of the proposed networks.
    Improving X-ray Diagnostics through Eye-Tracking and XR. (arXiv:2203.01643v1 [cs.HC])
    There is a growing need to assist radiologists in performing X-ray readings and diagnoses fast, comfortably, and effectively. As radiologists strive to maximize productivity, it is essential to consider the impact of reading rooms in interpreting complex examinations and ensure that higher volume and reporting speeds do not compromise patient outcomes. Virtual Reality (VR) is a disruptive technology for clinical practice in assessing X-ray images. We argue that conjugating eye-tracking with VR devices and Machine Learning may overcome obstacles posed by inadequate ergonomic postures and poor room conditions that often cause erroneous diagnostics when professionals examine digital images.
    Uniform Approximations for Randomized Hadamard Transforms with Applications. (arXiv:2203.01599v1 [cs.LG])
    Randomized Hadamard Transforms (RHTs) have emerged as a computationally efficient alternative to the use of dense unstructured random matrices across a range of domains in computer science and machine learning. For several applications such as dimensionality reduction and compressed sensing, the theoretical guarantees for methods based on RHTs are comparable to approaches using dense random matrices with i.i.d.\ entries. However, several such applications are in the low-dimensional regime where the number of rows sampled from the matrix is rather small. Prior arguments are not applicable to the high-dimensional regime often found in machine learning applications like kernel approximation. Given an ensemble of RHTs with Gaussian diagonals, $\{M^i\}_{i = 1}^m$, and any $1$-Lipschitz function, $f: \mathbb{R} \to \mathbb{R}$, we prove that the average of $f$ over the entries of $\{M^i v\}_{i = 1}^m$ converges to its expectation uniformly over $\| v \| \leq 1$ at a rate comparable to that obtained from using truly Gaussian matrices. We use our inequality to then derive improved guarantees for two applications in the high-dimensional regime: 1) kernel approximation and 2) distance estimation. For kernel approximation, we prove the first \emph{uniform} approximation guarantees for random features constructed through RHTs lending theoretical justification to their empirical success while for distance estimation, our convergence result implies data structures with improved runtime guarantees over previous work by the authors. We believe our general inequality is likely to find use in other applications.
    A Characterization of Multiclass Learnability. (arXiv:2203.01550v1 [cs.LG])
    A seminal result in learning theory characterizes the PAC learnability of binary classes through the Vapnik-Chervonenkis dimension. Extending this characterization to the general multiclass setting has been open since the pioneering works on multiclass PAC learning in the late 1980s. This work resolves this problem: we characterize multiclass PAC learnability through the DS dimension, a combinatorial dimension defined by Daniely and Shalev-Shwartz (2014). The classical characterization of the binary case boils down to empirical risk minimization. In contrast, our characterization of the multiclass case involves a variety of algorithmic ideas; these include a natural setting we call list PAC learning. In the list learning setting, instead of predicting a single outcome for a given unseen input, the goal is to provide a short menu of predictions. Our second main result concerns the Natarajan dimension, which has been a central candidate for characterizing multiclass learnability. This dimension was introduced by Natarajan (1988) as a barrier for PAC learning. Whether the Natarajan dimension characterizes PAC learnability in general has been posed as an open question in several papers since. This work provides a negative answer: we construct a non-learnable class with Natarajan dimension one. For the construction, we identify a fundamental connection between concept classes and topology (i.e., colorful simplicial complexes). We crucially rely on a deep and involved construction of hyperbolic pseudo-manifolds by Januszkiewicz and Swiatkowski. It is interesting that hyperbolicity is directly related to learning problems that are difficult to solve although no obvious barriers exist. This is another demonstration of the fruitful links machine learning has with different areas in mathematics.
    Data Augmentation as Feature Manipulation: a story of desert cows and grass cows. (arXiv:2203.01572v1 [cs.LG])
    Data augmentation is a cornerstone of the machine learning pipeline, yet its theoretical underpinnings remain unclear. Is it merely a way to artificially augment the data set size? Or is it about encouraging the model to satisfy certain invariance? In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for non-linear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view model by Allen-Zhu and Li [2020]. We complement this analysis with further experimental evidence that data augmentation can be viewed as a form of feature manipulation.
    Graph Neural Networks for Multimodal Single-Cell Data Integration. (arXiv:2203.01884v1 [cs.LG])
    Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: $\textit{modality prediction}$, $\textit{modality matching}$ and $\textit{joint embedding}$. In this work, we present a general Graph Neural Network framework $\textit{scMoGNN}$ to tackle these three tasks and show that $\textit{scMoGNN}$ demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of $\textit{modality prediction}$ from $\href{https://openproblems.bio/neurips_2021/}{\textit{NeurIPS 2021 Competition}}$.
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v1 [stat.ME])
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying risk.
    Sign Language Recognition System using TensorFlow Object Detection API. (arXiv:2201.01486v2 [cs.CV] UPDATED)
    Communication is defined as the act of sharing or exchanging information, ideas or feelings. To establish communication between two people, both of them are required to have knowledge and understanding of a common language. But in the case of deaf and dumb people, the means of communication are different. Deaf is the inability to hear and dumb is the inability to speak. They communicate using sign language among themselves and with normal people but normal people do not take seriously the importance of sign language. Not everyone possesses the knowledge and understanding of sign language which makes communication difficult between a normal person and a deaf and dumb person. To overcome this barrier, one can build a model based on machine learning. A model can be trained to recognize different gestures of sign language and translate them into English. This will help a lot of people in communicating and conversing with deaf and dumb people. The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset.
    Estimating Conditional Average Treatment Effects with Missing Treatment Information. (arXiv:2203.01422v1 [stat.ML])
    Estimating conditional average treatment effects (CATE) is challenging, especially when treatment information is missing. Although this is a widespread problem in practice, CATE estimation with missing treatments has received little attention. In this paper, we analyze CATE estimation in the setting with missing treatments where, thus, unique challenges arise in the form of covariate shifts. We identify two covariate shifts in our setting: (i) a covariate shift between the treated and control population; and (ii) a covariate shift between the observed and missing treatment population. We first theoretically show the effect of these covariate shifts by deriving a generalization bound for estimating CATE in our setting with missing treatments. Then, motivated by our bound, we develop the missing treatment representation network (MTRNet), a novel CATE estimation algorithm that learns a balanced representation of covariates using domain adaptation. By using balanced representations, MTRNet provides more reliable CATE estimates in the covariate domains where the data are not fully observed. In various experiments with semi-synthetic and real-world data, we show that our algorithm improves over the state-of-the-art by a substantial margin.
    Selective Residual M-Net for Real Image Denoising. (arXiv:2203.01645v1 [eess.IV])
    Image restoration is a low-level vision task which is to restore degraded images to noise-free images. With the success of deep neural networks, the convolutional neural networks surpass the traditional restoration methods and become the mainstream in the computer vision area. To advance the performanceof denoising algorithms, we propose a blind real image denoising network (SRMNet) by employing a hierarchical architecture improved from U-Net. Specifically, we use a selective kernel with residual block on the hierarchical structure called M-Net to enrich the multi-scale semantic information. Furthermore, our SRMNet has competitive performance results on two synthetic and two real-world noisy datasets in terms of quantitative metrics and visual quality. The source code and pretrained model are available at https://github.com/TentativeGitHub/SRMNet.
    Investigating the locality of neural network training dynamics. (arXiv:2111.01166v2 [cs.LG] UPDATED)
    In the recent past a certain property of neural training trajectories in weight-space had been isolated, that of "local elasticity" ($\srel$) - which attempts to quantify the propagation of influence of a sampled data point on the prediction at another data point. In this work, we embark on a comprehensive study of local elasticity. Firstly, specific to the classification setting, we suggest a new definition of the original idea of $\srel$. Via experiments on state-of-the-art neural networks training on SVHN, CIFAR-10 and CIFAR-100 we demonstrate how our new $\srel$ detects the property of the weight updates preferring to make changes in predictions within the same class as of the sampled data. Next, we demonstrate via examples of neural regression that the original $\srel$ reveals a $2-$phase behavior: that their training proceeds via an initial elastic phase when $\srel$ changes rapidly and an eventual inelastic phase when $\srel$ remains large. Lastly, we give multiple examples of learning via gradient flows for which one can get a closed-form expression of the original $\srel$ function. By studying the plots of these derived formulas we give theoretical demonstrations of some of the experimentally detected properties of $\srel$ in the regression setting.
    Neural Graph Matching for Pre-training Graph Neural Networks. (arXiv:2203.01597v1 [cs.LG])
    Recently, graph neural networks (GNNs) have been shown powerful capacity at modeling structural data. However, when adapted to downstream tasks, it usually requires abundant task-specific labeled data, which can be extremely scarce in practice. A promising solution to data scarcity is to pre-train a transferable and expressive GNN model on large amounts of unlabeled graphs or coarse-grained labeled graphs. Then the pre-trained GNN is fine-tuned on downstream datasets with task-specific fine-grained labels. In this paper, we present a novel Graph Matching based GNN Pre-Training framework, called GMPT. Focusing on a pair of graphs, we propose to learn structural correspondences between them via neural graph matching, consisting of both intra-graph message passing and inter-graph message passing. In this way, we can learn adaptive representations for a given graph when paired with different graphs, and both node- and graph-level characteristics are naturally considered in a single pre-training task. The proposed method can be applied to fully self-supervised pre-training and coarse-grained supervised pre-training. We further propose an approximate contrastive training strategy to significantly reduce time/memory consumption. Extensive experiments on multi-domain, out-of-distribution benchmarks have demonstrated the effectiveness of our approach. The code is available at: https://github.com/RUCAIBox/GMPT.
    A shallow physics-informed neural network for solving partial differential equations on surfaces. (arXiv:2203.01581v1 [math.NA])
    In this paper, we introduce a mesh-free physics-informed neural network for solving partial differential equations on surfaces. Based on the idea of embedding techniques, we write the underlying surface differential equations using conventional Cartesian differential operators. With the aid of level set function, the surface geometrical quantities, such as the normal and mean curvature of the surface, can be computed directly and used in our surface differential expressions. So instead of imposing the normal extension constraints used in literature, we take the whole Cartesian differential expressions into account in our loss function. Meanwhile, we adopt a completely shallow (one hidden layer) network so the present model is easy to implement and train. We perform a series of numerical experiments on both stationary and time-dependent partial differential equations on complicated surface geometries. The result shows that, with just a few hundred trainable parameters, our network model is able to achieve high predictive accuracy.
    Closed-form Continuous-time Neural Models. (arXiv:2106.13898v2 [cs.LG] UPDATED)
    Continuous-time neural processes are performant sequential decision-makers that are built by differential equations (DE). However, their expressive power when they are deployed on computers is bottlenecked by numerical DE solvers. This limitation has significantly slowed down the scaling and understanding of numerous natural physical phenomena such as the dynamics of nervous systems. Ideally, we would circumvent this bottleneck by solving the given dynamical system in closed form. This is known to be intractable in general. Here, we show it is possible to closely approximate the interaction between neurons and synapses -- the building blocks of natural and artificial neural networks -- constructed by liquid time-constant networks (LTCs) efficiently in closed-form. To this end, we compute a tightly-bounded approximation of the solution of an integral appearing in LTCs' dynamics, that has had no known closed-form solution so far. This closed-form solution substantially impacts the design of continuous-time and continuous-depth neural models; for instance, since time appears explicitly in closed-form, the formulation relaxes the need for complex numerical solvers. Consequently, we obtain models that are between one and five orders of magnitude faster in training and inference compared to differential equation-based counterparts. More importantly, in contrast to ODE-based continuous networks, closed-form networks can scale remarkably well compared to other deep learning instances. Lastly, as these models are derived from liquid networks, they show remarkable performance in time series modeling, compared to advanced recurrent models.
    Multi-Agent Variational Occlusion Inference Using People as Sensors. (arXiv:2109.02173v3 [cs.RO] UPDATED)
    Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents, hence treating people as sensors. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave similarly for different occupancy patterns ahead of them (e.g., a driver may move at constant speed in traffic or on an open road). Past work, however, does not account for this multimodality, thus neglecting to model this source of aleatoric uncertainty in the relationship between driver behaviors and their environment. We propose an occlusion inference method that characterizes observed behaviors of human agents as sensor measurements, and fuses them with those from a standard sensor suite. To capture the aleatoric uncertainty, we train a conditional variational autoencoder with a discrete latent space to learn a multimodal mapping from observed driver trajectories to an occupancy grid representation of the view ahead of the driver. Our method handles multi-agent scenarios, combining measurements from multiple observed drivers using evidential theory to solve the sensor fusion problem. Our approach is validated on a cluttered, real-world intersection, outperforming baselines and demonstrating real-time capable performance. Our code is available at https://github.com/sisl/MultiAgentVariationalOcclusionInference .
    Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD. (arXiv:2105.08023v2 [math.OC] UPDATED)
    We consider the decentralized stochastic optimization problems, where a network of $n$ nodes, each owning a local cost function, cooperate to find a minimizer of the globally-averaged cost. A widely studied decentralized algorithm for this problem is decentralized SGD (D-SGD), in which each node averages only with its neighbors. D-SGD is efficient in single-iteration communication, but it is very sensitive to the network topology. For smooth objective functions, the transient stage (which measures the number of iterations the algorithm has to experience before achieving the linear speedup stage) of D-SGD is on the order of ${\Omega}(n/(1-\beta)^2)$ and $\Omega(n^3/(1-\beta)^4)$ for strongly and generally convex cost functions, respectively, where $1-\beta \in (0,1)$ is a topology-dependent quantity that approaches $0$ for a large and sparse network. Hence, D-SGD suffers from slow convergence for large and sparse networks. In this work, we study the non-asymptotic convergence property of the D$^2$/Exact-diffusion algorithm. By eliminating the influence of data heterogeneity between nodes, D$^2$/Exact-diffusion is shown to have an enhanced transient stage that is on the order of $\tilde{\Omega}(n/(1-\beta))$ and $\Omega(n^3/(1-\beta)^2)$ for strongly and generally convex cost functions, respectively. Moreover, when D$^2$/Exact-diffusion is implemented with gradient accumulation and multi-round gossip communications, its transient stage can be further improved to $\tilde{\Omega}(1/(1-\beta)^{\frac{1}{2}})$ and $\tilde{\Omega}(n/(1-\beta))$ for strongly and generally convex cost functions, respectively. These established results for D$^2$/Exact-Diffusion have the best (i.e., weakest) dependence on network topology to our knowledge compared to existing decentralized algorithms. We also conduct numerical simulations to validate our theories.
    Reconstruction of univariate functions from directional persistence diagrams. (arXiv:2203.01894v1 [math.AT])
    We describe a method for approximating a single-variable function $f$ using persistence diagrams of sublevel sets of $f$ from height functions in different directions. We provide algorithms for the piecewise linear case and for the smooth case. Three directions suffice to locate all local maxima and minima of a piecewise linear continuous function from its collection of directional persistence diagrams, while five directions are needed in the case of smooth functions with non-degenerate critical points. Our approximation of functions by means of persistence diagrams is motivated by a study of importance attribution in machine learning, where one seeks to reduce the number of critical points of signal functions without a significant loss of information for a neural network classifier.
    CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. (arXiv:2203.01929v1 [cs.CV])
    This paper studies the complex task of simultaneous multi-object 3D reconstruction, 6D pose and size estimation from a single-view RGB-D observation. In contrast to instance-level pose estimation, we focus on a more challenging problem where CAD models are not available at inference time. Existing approaches mainly follow a complex multi-stage pipeline which first localizes and detects each object instance in the image and then regresses to either their 3D meshes or 6D poses. These approaches suffer from high-computational cost and low performance in complex multi-object scenarios, where occlusions can be present. Hence, we present a simple one-stage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner. In particular, our method treats object instances as spatial centers where each center denotes the complete shape of an object along with its 6D pose and size. Through this per-pixel representation, our approach can reconstruct in real-time (40 FPS) multiple novel object instances and predict their 6D pose and sizes in a single-forward pass. Through extensive experiments, we demonstrate that our approach significantly outperforms all shape completion and categorical 6D pose and size estimation baselines on multi-object ShapeNet and NOCS datasets respectively with a 12.6% absolute improvement in mAP for 6D pose for novel real-world object instances.
    Large-scale Optimization of Partial AUC in a Range of False Positive Rates. (arXiv:2203.01505v1 [cs.LG])
    The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/\epsilon^6)$ for finding a nearly $\epsilon$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization.
    Label-Only Model Inversion Attacks via Boundary Repulsion. (arXiv:2203.01925v1 [cs.LG])
    Recent studies show that the state-of-the-art deep neural networks are vulnerable to model inversion attacks, in which access to a model is abused to reconstruct private training data of any given target class. Existing attacks rely on having access to either the complete target model (whitebox) or the model's soft-labels (blackbox). However, no prior work has been done in the harder but more practical scenario, in which the attacker only has access to the model's predicted label, without a confidence measure. In this paper, we introduce an algorithm, Boundary-Repelling Model Inversion (BREP-MI), to invert private training data using only the target model's predicted labels. The key idea of our algorithm is to evaluate the model's predicted labels over a sphere and then estimate the direction to reach the target class's centroid. Using the example of face recognition, we show that the images reconstructed by BREP-MI successfully reproduce the semantics of the private training data for various datasets and target model architectures. We compare BREP-MI with the state-of-the-art whitebox and blackbox model inversion attacks and the results show that despite assuming less knowledge about the target model, BREP-MI outperforms the blackbox attack and achieves comparable results to the whitebox attack.
    Thermodynamics-informed graph neural networks. (arXiv:2203.01874v1 [cs.LG])
    In this paper we present a deep learning method to predict the time evolution of dissipative dynamical systems. We propose using both geometric and thermodynamic inductive biases to improve accuracy and generalization of the resulting integration scheme. The first is achieved with Graph Neural Networks, which induces a non-Euclidean geometrical prior and permutation invariant node and edge update functions. The second bias is forced by learning the GENERIC structure of the problem, an extension of the Hamiltonian formalism, to model more general non-conservative dynamics. Several examples are provided in both Eulerian and Lagrangian description in the context of fluid and solid mechanics respectively.
    Robust PAC$^m$: Training Ensemble Models Under Model Misspecification and Outliers. (arXiv:2203.01859v1 [cs.LG])
    Standard Bayesian learning is known to have suboptimal generalization capabilities under model misspecification and in the presence of outliers. PAC-Bayes theory demonstrates that the free energy criterion minimized by Bayesian learning is a bound on the generalization error for Gibbs predictors (i.e., for single models drawn at random from the posterior) under the assumption of sampling distributions uncontaminated by outliers. This viewpoint provides a justification for the limitations of Bayesian learning when the model is misspecified, requiring ensembling, and when data is affected by outliers. In recent work, PAC-Bayes bounds - referred to as PAC$^m$ - were derived to introduce free energy metrics that account for the performance of ensemble predictors, obtaining enhanced performance under misspecification. This work presents a novel robust free energy criterion that combines the generalized logarithm score function with PAC$^m$ ensemble bounds. The proposed free energy training criterion produces predictive distributions that are able to concurrently counteract the detrimental effects of model misspecification and outliers.
    Socially Aware Robot Crowd Navigation with Interaction Graphs and Human Trajectory Prediction. (arXiv:2203.01821v1 [cs.RO])
    We study the problem of safe and socially aware robot navigation in dense and interactive human crowds. Previous works use simplified methods to model the personal spaces of pedestrians and ignore the social compliance of the robot behaviors. In this paper, we provide a more accurate representation of personal zones of walking pedestrians with their future trajectories. The predicted personal zones are incorporated into a reinforcement learning framework to prevent the robot from intruding into the personal zones. To learn socially aware navigation policies, we propose a novel recurrent graph neural network with attention mechanisms to capture the interactions among agents through space and time. We demonstrate that our method enables the robot to achieve good navigation performance and non-invasiveness in challenging crowd navigation scenarios. We successfully transfer the policy learned in the simulator to a real-world TurtleBot 2i.
    On Learning Contrastive Representations for Learning with Noisy Labels. (arXiv:2203.01785v1 [cs.LG])
    Deep neural networks are able to memorize noisy labels easily with a softmax cross-entropy (CE) loss. Previous studies attempted to address this issue focus on incorporating a noise-robust loss function to the CE loss. However, the memorization issue is alleviated but still remains due to the non-robust CE loss. To address this issue, we focus on learning robust contrastive representations of data on which the classifier is hard to memorize the label noise under the CE loss. We propose a novel contrastive regularization function to learn such representations over noisy data where label noise does not dominate the representation learning. By theoretically investigating the representations induced by the proposed regularization function, we reveal that the learned representations keep information related to true labels and discard information related to corrupted labels. Moreover, our theoretical results also indicate that the learned representations are robust to the label noise. The effectiveness of this method is demonstrated with experiments on benchmark datasets.
    Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings. (arXiv:2203.01570v1 [cs.LG])
    A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.
    Private High-Dimensional Hypothesis Testing. (arXiv:2203.01537v1 [cs.DS])
    We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for $d$-dimensional Gaussian distributions with known covariance $\Sigma$, we can test whether the distribution comes from $\mathcal{N}(\mu^*, \Sigma)$ for some fixed $\mu^*$ or from some $\mathcal{N}(\mu, \Sigma)$ with total variation distance at least $\alpha$ from $\mathcal{N}(\mu^*, \Sigma)$ with $(\varepsilon, 0)$-differential privacy, using only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/3}}{\alpha^{4/3} \cdot \varepsilon^{2/3}} + \frac{1}{\alpha \cdot \varepsilon}\right)\] samples if the algorithm is allowed to be computationally inefficient, and only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/4}}{\alpha \cdot \varepsilon}\right)\] samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over $\{\pm 1\}^d$, and tolerant testing. Our results improve over the previous best work of Canonne, Kamath, McMillan, Ullman, and Zakynthinou \cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of $O\left(\frac{\sqrt{d}}{\alpha^2}\right)$ in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of $d$-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size $d$ \cite{AcharyaSZ18}, which refutes a conjectured lower bound of Canonne et al. \cite{CanonneKMUZ20}.
    Weakly Supervised Object Localization as Domain Adaption. (arXiv:2203.01714v1 [cs.CV])
    Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks. Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism. However, the MIL mechanism makes CAM only activate discriminative object parts rather than the whole object, weakening its performance for localizing objects. To avoid this problem, this work provides a novel perspective that models WSOL as a domain adaption (DA) task, where the score estimator trained on the source/image domain is tested on the target/pixel domain to locate objects. Under this perspective, a DA-WSOL pipeline is designed to better engage DA approaches into WSOL to enhance localization performance. It utilizes a proposed target sampling strategy to select different types of target samples. Based on these types of target samples, domain adaption localization (DAL) loss is elaborated. It aligns the feature distribution between the two domains by DA and makes the estimator perceive target domain cues by Universum regularization. Experiments show that our pipeline outperforms SOTA methods on multi benchmarks. Code are released at \url{https://github.com/zh460045050/DA-WSOL_CVPR2022}.
    Label Leakage and Protection from Forward Embedding in Vertical Federated Learning. (arXiv:2203.01451v1 [cs.LG])
    Vertical federated learning (vFL) has gained much attention and been deployed to solve machine learning problems with data privacy concerns in recent years. However, some recent work demonstrated that vFL is vulnerable to privacy leakage even though only the forward intermediate embedding (rather than raw features) and backpropagated gradients (rather than raw labels) are communicated between the involved participants. As the raw labels often contain highly sensitive information, some recent work has been proposed to prevent the label leakage from the backpropagated gradients effectively in vFL. However, these work only identified and defended the threat of label leakage from the backpropagated gradients. None of these work has paid attention to the problem of label leakage from the intermediate embedding. In this paper, we propose a practical label inference method which can steal private labels effectively from the shared intermediate embedding even though some existing protection methods such as label differential privacy and gradients perturbation are applied. The effectiveness of the label attack is inseparable from the correlation between the intermediate embedding and corresponding private labels. To mitigate the issue of label leakage from the forward embedding, we add an additional optimization goal at the label party to limit the label stealing ability of the adversary by minimizing the distance correlation between the intermediate embedding and corresponding private labels. We conducted massive experiments to demonstrate the effectiveness of our proposed protection methods.
    On an application of graph neural networks in population based SHM. (arXiv:2203.01646v1 [cs.LG])
    Attempts have been made recently in the field of population-based structural health monitoring (PBSHM), to transfer knowledge between SHM models of different structures. The attempts have been focussed on homogeneous and heterogeneous populations. A more general approach to transferring knowledge between structures, is by considering all plausible structures as points on a multidimensional base manifold and building a fibre bundle. The idea is quite powerful, since, a mapping between points in the base manifold and their fibres, the potential states of any arbitrary structure, can be learnt. A smaller scale problem, but still useful, is that of learning a specific point of every fibre, i.e. that corresponding to the undamaged state of structures within a population. Under the framework of PBSHM, a data-driven approach to the aforementioned problem is developed. Structures are converted into graphs and inference is attempted within a population, using a graph neural network (GNN) algorithm. The algorithm solves a major problem existing in such applications. Structures comprise different sizes and are defined as abstract objects, thus attempting to perform inference within a heterogeneous population is not trivial. The proposed approach is tested in a simulated population of trusses. The goal of the application is to predict the first natural frequency of trusses of different sizes, across different environmental temperatures and having different bar member types. After training the GNN using part of the total population, it was tested on trusses that were not included in the training dataset. Results show that the accuracy of the regression is satisfactory even in structures with higher number of nodes and members than those used to train it.
    Enhancing Adversarial Robustness for Deep Metric Learning. (arXiv:2203.01439v1 [cs.LG])
    Owing to security implications of adversarial vulnerability, adversarial robustness of deep metric learning models has to be improved. In order to avoid model collapse due to excessively hard examples, the existing defenses dismiss the min-max adversarial training, but instead learn from a weak adversary inefficiently. Conversely, we propose Hardness Manipulation to efficiently perturb the training triplet till a specified level of hardness for adversarial training, according to a harder benign triplet or a pseudo-hardness function. It is flexible since regular training and min-max adversarial training are its boundary cases. Besides, Gradual Adversary, a family of pseudo-hardness functions is proposed to gradually increase the specified hardness level during training for a better balance between performance and robustness. Additionally, an Intra-Class Structure loss term among benign and adversarial examples further improves model robustness and efficiency. Comprehensive experimental results suggest that the proposed method, although simple in its form, overwhelmingly outperforms the state-of-the-art defenses in terms of robustness, training efficiency, as well as performance on benign examples.
    Approximate Bayesian Computation Based on Maxima Weighted Isolation Kernel Mapping. (arXiv:2201.12745v3 [stat.ML] UPDATED)
    Motivation: A branching processes model yields an unevenly stochastically distributed dataset that consists of sparse and dense regions. This work addresses the problem of precisely evaluating parameters for such a model. Applying a branching processes model to an area such as cancer cell evolution faces a number of obstacles, including high dimensionality and the rare appearance of a result of interest. We take on the ambitious task of obtaining the coefficients of a model that reflects the relationship of driver gene mutations and cancer hallmarks on the basis of personal data regarding variant allele frequencies. Results: An approximate Bayesian computation method based on Isolation Kernel is developed. The method involves the transformation of row data to a Hilbert space (mapping) and the measurement of the similarity between simulated points and maxima weighted Isolation Kernel mapping related to the observation point. We also design a heuristic algorithm for parameter estimation that requires no calculation and is dimension independent. The advantages of the proposed machine learning method are illustrated using multidimensional test data as well as a specific example focused on cancer cell evolution.
    BERT Learns to Teach: Knowledge Distillation with Meta Learning. (arXiv:2106.04570v2 [cs.LG] UPDATED)
    We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.
    PetsGAN: Rethinking Priors for Single Image Generation. (arXiv:2203.01488v1 [cs.CV])
    Single image generation (SIG), described as generating diverse samples that have similar visual content with the given single image, is first introduced by SinGAN which builds a pyramid of GANs to progressively learn the internal patch distribution of the single image. It also shows great potentials in a wide range of image manipulation tasks. However, the paradigm of SinGAN has limitations in terms of generation quality and training time. Firstly, due to the lack of high-level information, SinGAN cannot handle the object images well as it does on the scene and texture images. Secondly, the separate progressive training scheme is time-consuming and easy to cause artifact accumulation. To tackle these problems, in this paper, we dig into the SIG problem and improve SinGAN by fully-utilization of internal and external priors. The main contributions of this paper include: 1) We introduce to SIG a regularized latent variable model. To the best of our knowledge, it is the first time to give a clear formulation and optimization goal of SIG, and all the existing methods for SIG can be regarded as special cases of this model. 2) We design a novel Prior-based end-to-end training GAN (PetsGAN) to overcome the problems of SinGAN. Our method gets rid of the time-consuming progressive training scheme and can be trained end-to-end. 3) We construct abundant qualitative and quantitative experiments to show the superiority of our method on both generated image quality, diversity, and the training speed. Moreover, we apply our method to other image manipulation tasks (e.g., style transfer, harmonization), and the results further prove the effectiveness and efficiency of our method.  ( 2 min )
    Colon Nuclei Instance Segmentation using a Probabilistic Two-Stage Detector. (arXiv:2203.01321v1 [eess.IV])
    Cancer is one of the leading causes of death in the developed world. Cancer diagnosis is performed through the microscopic analysis of a sample of suspicious tissue. This process is time consuming and error prone, but Deep Learning models could be helpful for pathologists during cancer diagnosis. We propose to change the CenterNet2 object detection model to also perform instance segmentation, which we call SegCenterNet2. We train SegCenterNet2 in the CoNIC challenge dataset and show that it performs better than Mask R-CNN in the competition metrics.  ( 2 min )
    Correct-N-Contrast: A Contrastive Approach for Improving Robustness to Spurious Correlations. (arXiv:2203.01517v1 [cs.LG])
    Spurious correlations pose a major challenge for robust machine learning. Models trained with empirical risk minimization (ERM) may learn to rely on correlations between class labels and spurious attributes, leading to poor performance on data groups without these correlations. This is particularly challenging to address when spurious attribute labels are unavailable. To improve worst-group performance on spuriously correlated data without training attribute labels, we propose Correct-N-Contrast (CNC), a contrastive approach to directly learn representations robust to spurious correlations. As ERM models can be good spurious attribute predictors, CNC works by (1) using a trained ERM model's outputs to identify samples with the same class but dissimilar spurious features, and (2) training a robust model with contrastive learning to learn similar representations for same-class samples. To support CNC, we introduce new connections between worst-group error and a representation alignment loss that CNC aims to minimize. We empirically observe that worst-group error closely tracks with alignment loss, and prove that the alignment loss over a class helps upper-bound the class's worst-group vs. average error gap. On popular benchmarks, CNC reduces alignment loss drastically, and achieves state-of-the-art worst-group accuracy by 3.6% average absolute lift. CNC is also competitive with oracle methods that require group labels.  ( 2 min )
    Adaptive Gradient Methods with Local Guarantees. (arXiv:2203.01400v1 [cs.LG])
    Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.  ( 2 min )
    Stable and Semi-stable Sampling Approaches for Continuously Used Samples. (arXiv:2203.01381v1 [cs.IR])
    Information retrieval systems are usually measured by labeling the relevance of results corresponding to a sample of user queries. In practical search engines, such measurement needs to be performed continuously, such as daily or weekly. This creates a trade-off between (a) representativeness of query sample to current query traffic of the product; (b) labeling cost: if we keep the same query sample, results would be similar allowing us to reuse their labels; and (c) overfitting caused by continuous usage of same query sample. In this paper we explicitly formulate this tradeoff, propose two new variants -- Stable and Semi-stable -- to simple and weighted random sampling and show that they outperform existing approaches for the continuous usage settings, including monitoring/debugging search engine or comparing ranker candidates.  ( 2 min )
    MetaDT: Meta Decision Tree for Interpretable Few-Shot Learning. (arXiv:2203.01482v1 [cs.LG])
    Few-Shot Learning (FSL) is a challenging task, which aims to recognize novel classes with few examples. Recently, lots of methods have been proposed from the perspective of meta-learning and representation learning for improving FSL performance. However, few works focus on the interpretability of FSL decision process. In this paper, we take a step towards the interpretable FSL by proposing a novel decision tree-based meta-learning framework, namely, MetaDT. Our insight is replacing the last black-box FSL classifier of the existing representation learning methods by an interpretable decision tree with meta-learning. The key challenge is how to effectively learn the decision tree (i.e., the tree structure and the parameters of each node) in the FSL setting. To address the challenge, we introduce a tree-like class hierarchy as our prior: 1) the hierarchy is directly employed as the tree structure; 2) by regarding the class hierarchy as an undirected graph, a graph convolution-based decision tree inference network is designed as our meta-learner to learn to infer the parameters of each node. At last, a two-loop optimization mechanism is incorporated into our framework for a fast adaptation of the decision tree with few examples. Extensive experiments on performance comparison and interpretability analysis show the effectiveness and superiority of our MetaDT. Our code will be publicly available upon acceptance.  ( 2 min )
    Scalable Bayesian Optimization Using Vecchia Approximations of Gaussian Processes. (arXiv:2203.01459v1 [cs.LG])
    Bayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. We adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and selecting multiple input values in parallel. We focus on the use of our warped Vecchia GP in trust-region Bayesian optimization via Thompson sampling. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art.  ( 2 min )
    Hyperspectral Pixel Unmixing with Latent Dirichlet Variational Autoencoder. (arXiv:2203.01327v1 [eess.IV])
    Hyperspectral pixel intensities result from a mixing of reflectances from several materials. This paper develops a method of hyperspectral pixel {\it unmixing} that aims to recover the "pure" spectral signal of each material (hereafter referred to as {\it endmembers}) together with the mixing ratios ({\it abundances}) given the spectrum of a single pixel. The unmixing problem is particularly relevant in the case of low-resolution hyperspectral images captured in a remote sensing setting, where individual pixels can cover large regions of the scene. Under the assumptions that (1) a multivariate Normal distribution can represent the spectra of an endmember and (2) a Dirichlet distribution can encode abundances of different endmembers, we develop a Latent Dirichlet Variational Autoencoder for hyperspectral pixel unmixing. Our approach achieves state-of-the-art results on standard benchmarks and on synthetic data generated using United States Geological Survey spectral library.  ( 2 min )
    Conditional Reconstruction for Open-set Semantic Segmentation. (arXiv:2203.01368v1 [cs.CV])
    Open set segmentation is a relatively new and unexploredtask, with just a handful of methods proposed to model suchtasks.We propose a novel method called CoReSeg thattackles the issue using class conditional reconstruction ofthe input images according to their pixelwise mask. Ourmethod conditions each input pixel to all known classes,expecting higher errors for pixels of unknown classes. Itwas observed that the proposed method produces better se-mantic consistency in its predictions, resulting in cleanersegmentation maps that better fit object boundaries. CoRe-Seg outperforms state-of-the-art methods on the Vaihin-gen and Potsdam ISPRS datasets, while also being com-petitive on the Houston 2018 IEEE GRSS Data Fusiondataset. Official implementation for CoReSeg is availableat:https://github.com/iannunes/CoReSeg.  ( 2 min )
    Privacy-Aware Crowd Labelling for Machine Learning Tasks. (arXiv:2203.01373v1 [cs.HC])
    The extensive use of online social media has highlighted the importance of privacy in the digital space. As more scientists analyse the data created in these platforms, privacy concerns have extended to data usage within the academia. Although text analysis is a well documented topic in academic literature with a multitude of applications, ensuring privacy of user-generated content has been overlooked. Most sentiment analysis methods require emotion labels, which can be obtained through crowdsourcing, where non-expert individuals contribute to scientific tasks. The text itself has to be exposed to third parties in order to be labelled. In an effort to reduce the exposure of online users' information, we propose a privacy preserving text labelling method for varying applications, based in crowdsourcing. We transform text with different levels of privacy, and analyse the effectiveness of the transformation with regards to label correlation and consistency. Our results suggest that privacy can be implemented in labelling, retaining the annotational diversity and subjectivity of traditional labelling.  ( 2 min )
    Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models. (arXiv:2203.01584v1 [cs.LG])
    Prioritizing fairness is of central importance in artificial intelligence (AI) systems, especially for those societal applications, e.g., hiring systems should recommend applicants equally from different demographic groups, and risk assessment systems must eliminate racism in criminal justice. Existing efforts towards the ethical development of AI systems have leveraged data science to mitigate biases in the training set or introduced fairness principles into the training process. For a deployed AI system, however, it may not allow for retraining or tuning in practice. By contrast, we propose a more flexible approach, i.e., fairness-aware adversarial perturbation (FAAP), which learns to perturb input data to blind deployed models on fairness-related features, e.g., gender and ethnicity. The key advantage is that FAAP does not modify deployed models in terms of parameters and structures. To achieve this, we design a discriminator to distinguish fairness-related attributes based on latent representations from deployed models. Meanwhile, a perturbation generator is trained against the discriminator, such that no fairness-related features could be extracted from perturbed inputs. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed FAAP. In addition, FAAP is validated on real-world commercial deployments (inaccessible to model parameters), which shows the transferability of FAAP, foreseeing the potential of black-box adaptation.  ( 2 min )
    3D Common Corruptions and Data Augmentation. (arXiv:2203.01441v1 [cs.CV])
    We introduce a set of image transformations that can be used as `corruptions' to evaluate the robustness of models as well as `data augmentation' mechanisms for training neural networks. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations -- thus leading to corruptions that are more likely to occur in the real world. We show these transformations are `efficient' (can be computed on-the-fly), `extendable' (can be applied on most datasets of real images), expose vulnerability of existing models, and can effectively make models more robust when employed as `3D data augmentation' mechanisms. Our evaluations performed on several tasks and datasets suggest incorporating 3D information into robustness benchmarking and training opens up a promising direction for robustness research.  ( 2 min )
    Skew-Symmetric Adjacency Matrices for Clustering Directed Graphs. (arXiv:2203.01388v1 [cs.LG])
    Cut-based directed graph (digraph) clustering often focuses on finding dense within-cluster or sparse between-cluster connections, similar to cut-based undirected graph clustering methods. In contrast, for flow-based clusterings the edges between clusters tend to be oriented in one direction and have been found in migration data, food webs, and trade data. In this paper we introduce a spectral algorithm for finding flow-based clusterings. The proposed algorithm is based on recent work which uses complex-valued Hermitian matrices to represent digraphs. By establishing an algebraic relationship between a complex-valued Hermitian representation and an associated real-valued, skew-symmetric matrix the proposed algorithm produces clusterings while remaining completely in the real field. Our algorithm uses less memory and asymptotically less computation while provably preserving solution quality. We also show the algorithm can be easily implemented using standard computational building blocks, possesses better numerical properties, and loans itself to a natural interpretation via an objective function relaxation argument.  ( 2 min )
    QaNER: Prompting Question Answering Models for Few-shot Named Entity Recognition. (arXiv:2203.01543v1 [cs.CL])
    Recently, prompt-based learning for pre-trained language models has succeeded in few-shot Named Entity Recognition (NER) by exploiting prompts as task guidance to increase label efficiency. However, previous prompt-based methods for few-shot NER have limitations such as a higher computational complexity, poor zero-shot ability, requiring manual prompt engineering, or lack of prompt robustness. In this work, we address these shortcomings by proposing a new prompt-based learning NER method with Question Answering (QA), called QaNER. Our approach includes 1) a refined strategy for converting NER problems into the QA formulation; 2) NER prompt generation for QA models; 3) prompt-based tuning with QA models on a few annotated NER examples; 4) zero-shot NER by prompting the QA model. Comparing the proposed approach with previous methods, QaNER is faster at inference, insensitive to the prompt quality, and robust to hyper-parameters, as well as demonstrating significantly better low-resource performance and zero-shot capability.  ( 2 min )
    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v1 [math.NA])
    Machine learning methods have been shown to give accurate predictions in high dimensions provided that sufficient training data are available. Yet, many interesting questions in science and engineering involve situations where initially no data are available and the principal aim is to gather insights from a known model. Here we consider this problem in the context of systems whose evolution can be described by partial differential equations (PDEs). We use deep learning to solve these equations by generating data on-the-fly when and where they are needed, without prior information about the solution. The proposed Neural Galerkin schemes derive nonlinear dynamical equations for the network weights by minimization of the residual of the time derivative of the solution, and solve these equations using standard integrators for initial value problems. The sequential learning of the weights over time allows for adaptive collection of new input data for residual estimation. This step uses importance sampling informed by the current state of the solution, in contrast with other machine learning methods for PDEs that optimize the network parameters globally in time. This active form of data acquisition is essential to enable the approximation power of the neural networks and to break the curse of dimensionality faced by non-adaptative learning strategies. The applicability of the method is illustrated on several numerical examples involving high-dimensional PDEs, including advection equations with many variables, as well as Fokker-Planck equations for systems with several interacting particles.  ( 2 min )
    Physics-informed neural network solution of thermo-hydro-mechanical (THM) processes in porous media. (arXiv:2203.01514v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) have received increased interest for forward, inverse, and surrogate modeling of problems described by partial differential equations (PDE). However, their application to multiphysics problem, governed by several coupled PDEs, present unique challenges that have hindered the robustness and widespread applicability of this approach. Here we investigate the application of PINNs to the forward solution of problems involving thermo-hydro-mechanical (THM) processes in porous media, which exhibit disparate spatial and temporal scales in thermal conductivity, hydraulic permeability, and elasticity. In addition, PINNs are faced with the challenges of the multi-objective and non-convex nature of the optimization problem. To address these fundamental issues, we: (1)~rewrite the THM governing equations in dimensionless form that is best suited for deep-learning algorithms; (2)~propose a sequential training strategy that circumvents the need for a simultaneous solution of the multiphysics problem and facilitates the task of optimizers in the solution search; and (3)~leverage adaptive weight strategies to overcome the stiffness in the gradient flow of the multi-objective optimization problem. Finally, we apply this framework to the solution of several synthetic problems in 1D and~2D.  ( 2 min )
    Detecting Chronic Kidney Disease(CKD) at the Initial Stage: A Novel Hybrid Feature-selection Method and Robust Data Preparation Pipeline for Different ML Techniques. (arXiv:2203.01394v1 [cs.LG])
    Chronic Kidney Disease (CKD) has infected almost 800 million people around the world. Around 1.7 million people die each year because of it. Detecting CKD in the initial stage is essential for saving millions of lives. Many researchers have applied distinct Machine Learning (ML) methods to detect CKD at an early stage, but detailed studies are still missing. We present a structured and thorough method for dealing with the complexities of medical data with optimal performance. Besides, this study will assist researchers in producing clear ideas on the medical data preparation pipeline. In this paper, we applied KNN Imputation to impute missing values, Local Outlier Factor to remove outliers, SMOTE to handle data imbalance, K-stratified K-fold Cross-validation to validate the ML models, and a novel hybrid feature selection method to remove redundant features. Applied algorithms in this study are Support Vector Machine, Gaussian Naive Bayes, Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbor, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Finally, the Random Forest can detect CKD with 100% accuracy without any data leakage.  ( 2 min )
    A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems. (arXiv:2203.01387v1 [cs.LG])
    With the widespread adoption of deep learning, reinforcement learning (RL) has experienced a dramatic increase in popularity, scaling to previously intractable problems, such as playing complex games from pixel observations, sustaining conversations with humans, and controlling robotic agents. However, there is still a wide range of domains inaccessible to RL due to the high cost and danger of interacting with the environment. Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. Effective offline RL algorithms have a much wider range of applications than online RL, being particularly appealing for real-world applications such as education, healthcare, and robotics. In this work, we propose a unifying taxonomy to classify offline RL methods. Furthermore, we provide a comprehensive review of the latest algorithmic breakthroughs in the field, and a review of existing benchmarks' properties and shortcomings. Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field.  ( 2 min )
    Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming. (arXiv:2203.01382v1 [cs.LG])
    Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristics as an interactive procedure, built around the existing workflow where users draw ideas from a selected set of development data for designing the heuristic sources. With the formalism, we study two core problems of how to strategically select the development data to guide users in efficiently creating informative heuristics, and how to exploit the information within the development process to contextualize and better learn from the resultant heuristics. Building upon two novel methodologies that effectively tackle the respective problems considered, we present Nemo, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.  ( 2 min )
    Continuous-Time Meta-Learning with Forward Mode Differentiation. (arXiv:2203.01443v1 [cs.LG])
    Drawing inspiration from gradient-based meta-learning methods with infinitely small gradient steps, we introduce Continuous-Time Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Specifically, representations of the inputs are meta-learned such that a task-specific linear classifier is obtained as a solution of an ordinary differential equation (ODE). Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous, as opposed to a fixed and discrete number of gradient steps. As a consequence, we can optimize the amount of adaptation necessary to solve a new task using stochastic gradient descent, in addition to learning the initial conditions as is standard practice in gradient-based meta-learning. Importantly, in order to compute the exact meta-gradients required for the outer-loop updates, we devise an efficient algorithm based on forward mode differentiation, whose memory requirements do not scale with the length of the learning trajectory, thus allowing longer adaptation in constant memory. We provide analytical guarantees for the stability of COMLN, we show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.  ( 2 min )
    Learning Stochastic Parametric Differentiable Predictive Control Policies. (arXiv:2203.01447v1 [cs.LG])
    The problem of synthesizing stochastic explicit model predictive control policies is known to be quickly intractable even for systems of modest complexity when using classical control-theoretic methods. To address this challenge, we present a scalable alternative called stochastic parametric differentiable predictive control (SP-DPC) for unsupervised learning of neural control policies governing stochastic linear systems subject to nonlinear chance constraints. SP-DPC is formulated as a deterministic approximation to the stochastic parametric constrained optimal control problem. This formulation allows us to directly compute the policy gradients via automatic differentiation of the problem's value function, evaluated over sampled parameters and uncertainties. In particular, the computed expectation of the SP-DPC problem's value function is backpropagated through the closed-loop system rollouts parametrized by a known nominal system dynamics model and neural control policy which allows for direct model-based policy optimization. We provide theoretical probabilistic guarantees for policies learned via the SP-DPC method on closed-loop stability and chance constraints satisfaction. Furthermore, we demonstrate the computational efficiency and scalability of the proposed policy optimization algorithm in three numerical examples, including systems with a large number of states or subject to nonlinear constraints.  ( 2 min )
    iMVS: Improving MVS Networks by Learning Depth Discontinuities. (arXiv:2203.01391v1 [cs.CV])
    Existing learning-based multi-view stereo (MVS) techniques are effective in terms of completeness in reconstruction. We further improve these techniques by learning depth continuities. Our idea is to jointly estimate the depth and boundary maps. To this end, we introduce learning-based MVS strategies to improve the quality of depth maps via mixture density and depth discontinuity learning. We validate our idea and demonstrate that our strategies can be easily integrated into existing learning-based MVS pipelines where the reconstruction depends on high-quality depth map estimation. We also introduce a bimodal depth representation and a novel spatial regularization approach to the MVS networks. Extensive experiments on various datasets show that our method sets a new state of the art in terms of completeness and overall reconstruction quality. Experiments also demonstrate that the presented model and strategies have good generalization capabilities. The source code will be available soon.  ( 2 min )
    Successful Recovery of an Observed Meteorite Fall Using Drones and Machine Learning. (arXiv:2203.01466v1 [astro-ph.EP])
    We report the first-time recovery of a fresh meteorite fall using a drone and a machine learning algorithm. A fireball on the 1st April 2021 was observed over Western Australia by the Desert Fireball Network, for which a fall area was calculated for the predicted surviving mass. A search team arrived on site and surveyed 5.1 km2 area over a 4-day period. A convolutional neural network, trained on previously-recovered meteorites with fusion crusts, processed the images on our field computer after each flight. meteorite candidates identified by the algorithm were sorted by team members using two user interfaces to eliminate false positives. Surviving candidates were revisited with a smaller drone, and imaged in higher resolution, before being eliminated or finally being visited in-person. The 70 g meteorite was recovered within 50 m of the calculated fall line using, demonstrating the effectiveness of this methodology which will facilitate the efficient collection of many more observed meteorite falls.  ( 2 min )
    Precise Stock Price Prediction for Optimized Portfolio Design Using an LSTM Model. (arXiv:2203.01326v1 [q-fin.PM])
    Accurate prediction of future prices of stocks is a difficult task to perform. Even more challenging is to design an optimized portfolio of stocks with the identification of proper weights of allocation to achieve the optimized values of return and risk. We present optimized portfolios based on the seven sectors of the Indian economy. The past prices of the stocks are extracted from the web from January 1, 2016, to December 31, 2020. Optimum portfolios are designed on the selected seven sectors. An LSTM regression model is also designed for predicting future stock prices. Five months after the construction of the portfolios, i.e., on June 1, 2021, the actual and predicted returns and risks of each portfolio are computed. The predicted and the actual returns indicate the very high accuracy of the LSTM model.  ( 2 min )
    Deep Q-network using reservoir computing with multi-layered readout. (arXiv:2203.01465v1 [cs.LG])
    Recurrent neural network (RNN) based reinforcement learning (RL) is used for learning context-dependent tasks and has also attracted attention as a method with remarkable learning performance in recent research. However, RNN-based RL has some issues that the learning procedures tend to be more computationally expensive, and training with backpropagation through time (BPTT) is unstable because of vanishing/exploding gradients problem. An approach with replay memory introducing reservoir computing has been proposed, which trains an agent without BPTT and avoids these issues. The basic idea of this approach is that observations from the environment are input to the reservoir network, and both the observation and the reservoir output are stored in the memory. This paper shows that the performance of this method improves by using a multi-layered neural network for the readout layer, which regularly consists of a single linear layer. The experimental results show that using multi-layered readout improves the learning performance of four classical control tasks that require time-series processing.  ( 2 min )
    Rethinking the role of normalization and residual blocks for spiking neural networks. (arXiv:2203.01544v1 [cs.NE])
    Biologically inspired spiking neural networks (SNNs) are widely used to realize ultralow-power energy consumption. However, deep SNNs are not easy to train due to the excessive firing of spiking neurons in the hidden layers. To tackle this problem, we propose a novel but simple normalization technique called postsynaptic potential normalization. This normalization removes the subtraction term from the standard normalization and uses the second raw moment instead of the variance as the division term. The spike firing can be controlled, enabling the training to proceed appropriating, by conducting this simple normalization to the postsynaptic potential. The experimental results show that SNNs with our normalization outperformed other models using other normalizations. Furthermore, through the pre-activation residual blocks, the proposed model can train with more than 100 layers without other special techniques dedicated to SNNs.  ( 2 min )
    Near-Optimal Correlation Clustering with Privacy. (arXiv:2203.01440v1 [cs.LG])
    Correlation clustering is a central problem in unsupervised learning, with applications spanning community detection, duplicate detection, automated labelling and many more. In the correlation clustering problem one receives as input a set of nodes and for each node a list of co-clustering preferences, and the goal is to output a clustering that minimizes the disagreement with the specified nodes' preferences. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees. Our approximation guarantees are stronger than those shown in prior work and are optimal up to logarithmic factors.  ( 2 min )
    LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives. (arXiv:2203.01445v1 [cs.CV])
    The volume of available data has grown dramatically in recent years in many applications. Furthermore, the age of networks that used multiple modalities separately has practically ended. Therefore, enabling bidirectional cross-modality data retrieval capable of processing has become a requirement for many domains and disciplines of research. This is especially true in the medical field, as data comes in a multitude of types, including various types of images and reports as well as molecular data. Most contemporary works apply cross attention to highlight the essential elements of an image or text in relation to the other modalities and try to match them together. However, regardless of their importance in their own modality, these approaches usually consider features of each modality equally. In this study, self-attention as an additional loss term will be proposed to enrich the internal representation provided into the cross attention module. This work suggests a novel architecture with a new loss term to help represent images and texts in the joint latent space. Experiment results on two benchmark datasets, i.e. MS-COCO and ARCH, show the effectiveness of the proposed method.  ( 2 min )
    Self-supervised Transparent Liquid Segmentation for Robotic Pouring. (arXiv:2203.01538v1 [cs.RO])
    Liquid state estimation is important for robotics tasks such as pouring; however, estimating the state of transparent liquids is a challenging problem. We propose a novel segmentation pipeline that can segment transparent liquids such as water from a static, RGB image without requiring any manual annotations or heating of the liquid for training. Instead, we use a generative model that is capable of translating images of colored liquids into synthetically generated transparent liquid images, trained only on an unpaired dataset of colored and transparent liquid images. Segmentation labels of colored liquids are obtained automatically using background subtraction. Our experiments show that we are able to accurately predict a segmentation mask for transparent liquids without requiring any manual annotations. We demonstrate the utility of transparent liquid segmentation in a robotic pouring task that controls pouring by perceiving the liquid height in a transparent cup. Accompanying video and supplementary materials can be found  ( 2 min )
    Weightless Neural Networks for Efficient Edge Inference. (arXiv:2203.01479v1 [cs.AR])
    Weightless Neural Networks (WNNs) are a class of machine learning model which use table lookups to perform inference. This is in contrast with Deep Neural Networks (DNNs), which use multiply-accumulate operations. State-of-the-art WNN architectures have a fraction of the implementation cost of DNNs, but still lag behind them on accuracy for common image recognition tasks. Additionally, many existing WNN architectures suffer from high memory requirements. In this paper, we propose a novel WNN architecture, BTHOWeN, with key algorithmic and architectural improvements over prior work, namely counting Bloom filters, hardware-friendly hashing, and Gaussian-based nonlinear thermometer encodings to improve model accuracy and reduce area and energy consumption. BTHOWeN targets the large and growing edge computing sector by providing superior latency and energy efficiency to comparable quantized DNNs. Compared to state-of-the-art WNNs across nine classification datasets, BTHOWeN on average reduces error by more than than 40% and model size by more than 50%. We then demonstrate the viability of the BTHOWeN architecture by presenting an FPGA-based accelerator, and compare its latency and resource usage against similarly accurate quantized DNN accelerators, including Multi-Layer Perceptron (MLP) and convolutional models. The proposed BTHOWeN models consume almost 80% less energy than the MLP models, with nearly 85% reduction in latency. In our quest for efficient ML on the edge, WNNs are clearly deserving of additional attention.  ( 2 min )
  • Open

    Optimal Adaptive Matrix Completion. (arXiv:2002.02431v2 [cs.LG] UPDATED)
    We study the problem of exact completion for $m \times n$ sized matrix of rank r with the adaptive sampling method. We introduce a relation of the exact completion problem with the sparsest vector of column and row spaces (which we call sparsity-number here). Using this relation, we propose matrix completion algorithms that exactly recovers the target matrix. These algorithms are superior to previous works in two important ways. First, our algorithms exactly recover $\mu_0$-coherent column space matrices by probability at least $1-\epsilon$ using much smaller observations complexity than - $\mathcal{O}(\mu_0 rn \mathrm{log}\frac{r}{\epsilon})$ - the state of art. Specifically, many of the previous adaptive sampling methods require to observe the entire matrix when the column space is highly coherent. However, we show that our method is still able to recover this type of matrices by observing a small fraction of entries under many scenarios. Second, we propose an exact completion algorithm, which requires minimal pre-information as either row or column space is not being highly coherent. We provide an extension of these algorithms that is robust to sparse random noise. Besides, we propose an additional low-rank estimation algorithm that is robust to any small noise by adaptively studying the shape of column space. At the end of the paper, we provide experimental results that illustrate the strength of the algorithms proposed here.
    Joint Probability Estimation Using Tensor Decomposition and Dictionaries. (arXiv:2203.01667v1 [cs.LG])
    In this work, we study non-parametric estimation of joint probabilities of a given set of discrete and continuous random variables from their (empirically estimated) 2D marginals, under the assumption that the joint probability could be decomposed and approximated by a mixture of product densities/mass functions. The problem of estimating the joint probability density function (PDF) using semi-parametric techniques such as Gaussian Mixture Models (GMMs) is widely studied. However such techniques yield poor results when the underlying densities are mixtures of various other families of distributions such as Laplacian or generalized Gaussian, uniform, Cauchy, etc. Further, GMMs are not the best choice to estimate joint distributions which are hybrid in nature, i.e., some random variables are discrete while others are continuous. We present a novel approach for estimating the PDF using ideas from dictionary representations in signal processing coupled with low rank tensor decompositions. To the best our knowledge, this is the first work on estimating joint PDFs employing dictionaries alongside tensor decompositions. We create a dictionary of various families of distributions by inspecting the data, and use it to approximate each decomposed factor of the product in the mixture. Our approach can naturally handle hybrid $N$-dimensional distributions. We test our approach on a variety of synthetic and real datasets to demonstrate its effectiveness in terms of better classification rates and lower error rates, when compared to state of the art estimators.
    Continuous Relaxation For The Multivariate Non-Central Hypergeometric Distribution. (arXiv:2203.01629v1 [cs.LG])
    Partitioning a set of elements into a given number of groups of a priori unknown sizes is an important task in many applications. Due to hard constraints, it is a non-differentiable problem which prohibits its direct use in modern machine learning frameworks. Hence, previous works mostly fall back on suboptimal heuristics or simplified assumptions. The multivariate hypergeometric distribution offers a probabilistic formulation of how to distribute a given number of samples across multiple groups. Unfortunately, as a discrete probability distribution, it neither is differentiable. In this work, we propose a continuous relaxation for the multivariate non-central hypergeometric distribution. We introduce an efficient and numerically stable sampling procedure. This enables reparameterized gradients for the hypergeometric distribution and its integration into automatic differentiation frameworks. We highlight the applicability and usability of the proposed formulation on two different common machine learning tasks.
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v1 [stat.ME])
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying risk.
    Sparse Bayesian Optimization. (arXiv:2203.01900v1 [cs.LG])
    Bayesian optimization (BO) is a powerful approach to sample-efficient optimization of black-box objective functions. However, the application of BO to areas such as recommendation systems often requires taking the interpretability and simplicity of the configurations into consideration, a setting that has not been previously studied in the BO literature. To make BO applicable in this setting, we present several regularization-based approaches that allow us to discover sparse and more interpretable configurations. We propose a novel differentiable relaxation based on homotopy continuation that makes it possible to target sparsity by working directly with $L_0$ regularization. We identify failure modes for regularized BO and develop a hyperparameter-free method, sparsity exploring Bayesian optimization (SEBO) that seeks to simultaneously maximize a target objective and sparsity. SEBO and methods based on fixed regularization are evaluated on synthetic and real-world problems, and we show that we are able to efficiently optimize for sparsity.
    Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning. (arXiv:2001.00119v2 [cs.LG] UPDATED)
    Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore
    Data Augmentation as Feature Manipulation: a story of desert cows and grass cows. (arXiv:2203.01572v1 [cs.LG])
    Data augmentation is a cornerstone of the machine learning pipeline, yet its theoretical underpinnings remain unclear. Is it merely a way to artificially augment the data set size? Or is it about encouraging the model to satisfy certain invariance? In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for non-linear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view model by Allen-Zhu and Li [2020]. We complement this analysis with further experimental evidence that data augmentation can be viewed as a form of feature manipulation.
    Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings. (arXiv:2203.01570v1 [cs.LG])
    A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.
    Estimating Conditional Average Treatment Effects with Missing Treatment Information. (arXiv:2203.01422v1 [stat.ML])
    Estimating conditional average treatment effects (CATE) is challenging, especially when treatment information is missing. Although this is a widespread problem in practice, CATE estimation with missing treatments has received little attention. In this paper, we analyze CATE estimation in the setting with missing treatments where, thus, unique challenges arise in the form of covariate shifts. We identify two covariate shifts in our setting: (i) a covariate shift between the treated and control population; and (ii) a covariate shift between the observed and missing treatment population. We first theoretically show the effect of these covariate shifts by deriving a generalization bound for estimating CATE in our setting with missing treatments. Then, motivated by our bound, we develop the missing treatment representation network (MTRNet), a novel CATE estimation algorithm that learns a balanced representation of covariates using domain adaptation. By using balanced representations, MTRNet provides more reliable CATE estimates in the covariate domains where the data are not fully observed. In various experiments with semi-synthetic and real-world data, we show that our algorithm improves over the state-of-the-art by a substantial margin.
    Early Time-Series Classification Algorithms: An Empirical Comparison. (arXiv:2203.01628v1 [cs.LG])
    Early Time-Series Classification (ETSC) is the task of predicting the class of incoming time-series by observing as few measurements as possible. Such methods can be employed to obtain classification forecasts in many time-critical applications. However, available techniques are not equally suitable for every problem, since differentiations in the data characteristics can impact algorithm performance in terms of earliness, accuracy, F1-score, and training time. We evaluate six existing ETSC algorithms on publicly available data, as well as on two newly introduced datasets originating from the life sciences and maritime domains. Our goal is to provide a framework for the evaluation and comparison of ETSC algorithms and to obtain intuition on how such approaches perform on real-life applications. The presented framework may also serve as a benchmark for new related techniques.
    Reinforcement Learning in Possibly Nonstationary Environments. (arXiv:2203.01707v1 [stat.ML])
    We consider reinforcement learning (RL) methods in offline nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimisation in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v1 [cs.CV])
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method calculates topological features from 3D volumetric data based on cubical complexes and uses an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.
    Accelerated SGD for Non-Strongly-Convex Least Squares. (arXiv:2203.01744v1 [cs.LG])
    We consider stochastic approximation for the least squares regression problem in the non-strongly convex setting. We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as $O(d/t)$ while accelerating the forgetting of the initial conditions to $O(d/t^2)$. Our new algorithm is based on a simple modification of the accelerated gradient descent. We provide convergence results for both the averaged and the last iterate of the algorithm. In order to describe the tightness of these new bounds, we present a matching lower bound in the noiseless setting and thus show the optimality of our algorithm.
    Local Constraint-Based Causal Discovery under Selection Bias. (arXiv:2203.01848v1 [cs.LG])
    We consider the problem of discovering causal relations from independence constraints selection bias in addition to confounding is present. While the seminal FCI algorithm is sound and complete in this setup, no criterion for the causal interpretation of its output under selection bias is presently known. We focus instead on local patterns of independence relations, where we find no sound method for only three variable that can include background knowledge. Y-Structure patterns are shown to be sound in predicting causal relations from data under selection bias, where cycles may be present. We introduce a finite-sample scoring rule for Y-Structures that is shown to successfully predict causal relations in simulation experiments that include selection mechanisms. On real-world microarray data, we show that a Y-Structure variant performs well across different datasets, potentially circumventing spurious correlations due to selection bias.
    Scalable Bayesian Optimization Using Vecchia Approximations of Gaussian Processes. (arXiv:2203.01459v1 [cs.LG])
    Bayesian optimization is a technique for optimizing black-box target functions. At the core of Bayesian optimization is a surrogate model that predicts the output of the target function at previously unseen inputs to facilitate the selection of promising input values. Gaussian processes (GPs) are commonly used as surrogate models but are known to scale poorly with the number of observations. We adapt the Vecchia approximation, a popular GP approximation from spatial statistics, to enable scalable high-dimensional Bayesian optimization. We develop several improvements and extensions, including training warped GPs using mini-batch gradient descent, approximate neighbor search, and selecting multiple input values in parallel. We focus on the use of our warped Vecchia GP in trust-region Bayesian optimization via Thompson sampling. On several test functions and on two reinforcement-learning problems, our methods compared favorably to the state of the art.
    A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems. (arXiv:2203.01387v1 [cs.LG])
    With the widespread adoption of deep learning, reinforcement learning (RL) has experienced a dramatic increase in popularity, scaling to previously intractable problems, such as playing complex games from pixel observations, sustaining conversations with humans, and controlling robotic agents. However, there is still a wide range of domains inaccessible to RL due to the high cost and danger of interacting with the environment. Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. Effective offline RL algorithms have a much wider range of applications than online RL, being particularly appealing for real-world applications such as education, healthcare, and robotics. In this work, we propose a unifying taxonomy to classify offline RL methods. Furthermore, we provide a comprehensive review of the latest algorithmic breakthroughs in the field, and a review of existing benchmarks' properties and shortcomings. Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field.
    Fail-Safe Generative Adversarial Imitation Learning. (arXiv:2203.01696v1 [cs.LG])
    For flexible yet safe imitation learning (IL), we propose a modular approach that uses a generative imitator policy with a safety layer, has an overall explicit density/gradient, can therefore be end-to-end trained using generative adversarial IL (GAIL), and comes with theoretical worst-case safety/robustness guarantees. The safety layer's exact density comes from using a countable non-injective gluing of piecewise differentiable injections and the change-of-variables formula. The safe set (into which the safety layer maps) is inferred by sampling actions and their potential future fail-safe fallback continuations, together with Lipschitz continuity and convexity arguments. We also provide theoretical bounds showing the advantage of using the safety layer already during training (imitation error linear in the horizon) compared to only using it at test time (quadratic error). In an experiment on challenging real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.
    Kernel Density Estimation by Genetic Algorithm. (arXiv:2203.01535v1 [stat.ME])
    This study proposes a data condensation method for multivariate kernel density estimation by genetic algorithm. First, our proposed algorithm generates multiple subsamples of a given size with replacement from the original sample. The subsamples and their constituting data points are regarded as $\it{chromosome}$ and $\it{gene}$, respectively, in the terminology of genetic algorithm. Second, each pair of subsamples breeds two new subsamples, where each data point faces either $\it{crossover}$, $\it{mutation}$, or $\it{reproduction}$ with a certain probability. The dominant subsamples in terms of fitness values are inherited by the next generation. This process is repeated generation by generation and brings the sparse representation of kernel density estimator in its completion. We confirmed from simulation studies that the resulting estimator can perform better than other well-known density estimators.
    On Practical Reinforcement Learning: Provable Robustness, Scalability, and Statistical Efficiency. (arXiv:2203.01758v1 [cs.LG])
    This thesis rigorously studies fundamental reinforcement learning (RL) methods in modern practical considerations, including robust RL, distributional RL, and offline RL with neural function approximation. The thesis first prepares the readers with an overall overview of RL and key technical background in statistics and optimization. In each of the settings, the thesis motivates the problems to be studied, reviews the current literature, provides computationally efficient algorithms with provable efficiency guarantees, and concludes with future research directions. The thesis makes fundamental contributions to the three settings above, both algorithmically, theoretically, and empirically, while staying relevant to practical considerations.
    Large-scale Optimization of Partial AUC in a Range of False Positive Rates. (arXiv:2203.01505v1 [cs.LG])
    The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/\epsilon^6)$ for finding a nearly $\epsilon$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization.
    A new harmonium for pattern recognition in survival data. (arXiv:2110.01960v2 [cs.LG] UPDATED)
    Survival analysis concerns the study of timeline data where the event of interest may remain unobserved (i.e., censored). Studies commonly record more than one type of event, but conventional survival techniques focus on a single event type. We set out to integrate both multiple independently censored time-to-event variables as well as missing observations. An energy-based approach is taken with a bi-partite structure between latent and visible states, commonly known as harmoniums (or restricted Boltzmann machines). The present harmonium is shown, both theoretically and experimentally, to capture non-linear patterns between distinct time recordings. We illustrate on real world data that, for a single time-to-event variable, our model is on par with established methods. In addition, we demonstrate that discriminative predictions improve by leveraging an extra time-to-event variable. In conclusion, multiple time-to-event variables can be successfully captured within the harmonium paradigm.
    T-Cal: An optimal test for the calibration of predictive models. (arXiv:2203.01850v1 [stat.ML])
    The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
    The Role of Pretrained Representations for the OOD Generalization of RL Agents. (arXiv:2107.05686v3 [cs.LG] UPDATED)
    Building sample-efficient agents that generalize out-of-distribution (OOD) in real-world settings remains a fundamental unsolved problem on the path towards achieving higher-level cognition. One particularly promising approach is to begin with low-dimensional, pretrained representations of our world, which should facilitate efficient downstream learning and generalization. By training 240 representations and over 10,000 reinforcement learning (RL) policies on a simulated robotic setup, we evaluate to what extent different properties of pretrained VAE-based representations affect the OOD generalization of downstream agents. We observe that many agents are surprisingly robust to realistic distribution shifts, including the challenging sim-to-real case. In addition, we find that the generalization performance of a simple downstream proxy task reliably predicts the generalization performance of our RL agents under a wide range of OOD settings. Such proxy tasks can thus be used to select pretrained representations that will lead to agents that generalize.
    Exploiting locality in high-dimensional factorial hidden Markov models. (arXiv:1902.01639v3 [stat.ML] UPDATED)
    We propose algorithms for approximate filtering and smoothing in high-dimensional Factorial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is "dimension-free" in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes' rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network.
    Uniform Approximations for Randomized Hadamard Transforms with Applications. (arXiv:2203.01599v1 [cs.LG])
    Randomized Hadamard Transforms (RHTs) have emerged as a computationally efficient alternative to the use of dense unstructured random matrices across a range of domains in computer science and machine learning. For several applications such as dimensionality reduction and compressed sensing, the theoretical guarantees for methods based on RHTs are comparable to approaches using dense random matrices with i.i.d.\ entries. However, several such applications are in the low-dimensional regime where the number of rows sampled from the matrix is rather small. Prior arguments are not applicable to the high-dimensional regime often found in machine learning applications like kernel approximation. Given an ensemble of RHTs with Gaussian diagonals, $\{M^i\}_{i = 1}^m$, and any $1$-Lipschitz function, $f: \mathbb{R} \to \mathbb{R}$, we prove that the average of $f$ over the entries of $\{M^i v\}_{i = 1}^m$ converges to its expectation uniformly over $\| v \| \leq 1$ at a rate comparable to that obtained from using truly Gaussian matrices. We use our inequality to then derive improved guarantees for two applications in the high-dimensional regime: 1) kernel approximation and 2) distance estimation. For kernel approximation, we prove the first \emph{uniform} approximation guarantees for random features constructed through RHTs lending theoretical justification to their empirical success while for distance estimation, our convergence result implies data structures with improved runtime guarantees over previous work by the authors. We believe our general inequality is likely to find use in other applications.
    Understanding microbiome dynamics via interpretable graph representation learning. (arXiv:2203.01830v1 [q-bio.QM])
    Large-scale perturbations in the microbiome constitution are strongly correlated, whether as a driver or a consequence, with the health and functioning of human physiology. However, understanding the difference in the microbiome profiles of healthy and ill individuals can be complicated due to the large number of complex interactions among microbes. We propose to model these interactions as a time-evolving graph whose nodes are microbes and edges are interactions among them. Motivated by the need to analyse such complex interactions, we develop a method that learns a low-dimensional representation of the time-evolving graph and maintains the dynamics occurring in the high-dimensional space. Through our experiments, we show that we can extract graph features such as clusters of nodes or edges that have the highest impact on the model to learn the low-dimensional representation. This information can be crucial to identify microbes and interactions among them that are strongly correlated with clinical diseases. We conduct our experiments on both synthetic and real-world microbiome datasets.
    Identification in Tree-shaped Linear Structural Causal Models. (arXiv:2203.01852v1 [cs.AI])
    Linear structural equation models represent direct causal effects as directed edges and confounding factors as bidirected edges. An open problem is to identify the causal parameters from correlations between the nodes. We investigate models, whose directed component forms a tree, and show that there, besides classical instrumental variables, missing cycles of bidirected edges can be used to identify the model. They can yield systems of quadratic equations that we explicitly solve to obtain one or two solutions for the causal parameters of adjacent directed edges. We show how multiple missing cycles can be combined to obtain a unique solution. This results in an algorithm that can identify instances that previously required approaches based on Gr\"obner bases, which have doubly-exponential time complexity in the number of structural parameters.  ( 2 min )
    Providing reliability in Recommender Systems through Bernoulli Matrix Factorization. (arXiv:2006.03481v5 [cs.LG] UPDATED)
    Beyond accuracy, quality measures are gaining importance in modern recommender systems, with reliability being one of the most important indicators in the context of collaborative filtering. This paper proposes Bernoulli Matrix Factorization (BeMF), which is a matrix factorization model, to provide both prediction values and reliability values. BeMF is a very innovative approach from several perspectives: a) it acts on model-based collaborative filtering rather than on memory-based filtering, b) it does not use external methods or extended architectures, such as existing solutions, to provide reliability, c) it is based on a classification-based model instead of traditional regression-based models, and d) matrix factorization formalism is supported by the Bernoulli distribution to exploit the binary nature of the designed classification model. The experimental results show that the more reliable a prediction is, the less liable it is to be wrong: recommendation quality improves after the most reliable predictions are selected. State-of-the-art quality measures for reliability have been tested, which shows that BeMF outperforms previous baseline methods and models.  ( 2 min )
    Investigating the locality of neural network training dynamics. (arXiv:2111.01166v2 [cs.LG] UPDATED)
    In the recent past a certain property of neural training trajectories in weight-space had been isolated, that of "local elasticity" ($\srel$) - which attempts to quantify the propagation of influence of a sampled data point on the prediction at another data point. In this work, we embark on a comprehensive study of local elasticity. Firstly, specific to the classification setting, we suggest a new definition of the original idea of $\srel$. Via experiments on state-of-the-art neural networks training on SVHN, CIFAR-10 and CIFAR-100 we demonstrate how our new $\srel$ detects the property of the weight updates preferring to make changes in predictions within the same class as of the sampled data. Next, we demonstrate via examples of neural regression that the original $\srel$ reveals a $2-$phase behavior: that their training proceeds via an initial elastic phase when $\srel$ changes rapidly and an eventual inelastic phase when $\srel$ remains large. Lastly, we give multiple examples of learning via gradient flows for which one can get a closed-form expression of the original $\srel$ function. By studying the plots of these derived formulas we give theoretical demonstrations of some of the experimentally detected properties of $\srel$ in the regression setting.  ( 2 min )
    Approximate Bayesian Computation Based on Maxima Weighted Isolation Kernel Mapping. (arXiv:2201.12745v3 [stat.ML] UPDATED)
    Motivation: A branching processes model yields an unevenly stochastically distributed dataset that consists of sparse and dense regions. This work addresses the problem of precisely evaluating parameters for such a model. Applying a branching processes model to an area such as cancer cell evolution faces a number of obstacles, including high dimensionality and the rare appearance of a result of interest. We take on the ambitious task of obtaining the coefficients of a model that reflects the relationship of driver gene mutations and cancer hallmarks on the basis of personal data regarding variant allele frequencies. Results: An approximate Bayesian computation method based on Isolation Kernel is developed. The method involves the transformation of row data to a Hilbert space (mapping) and the measurement of the similarity between simulated points and maxima weighted Isolation Kernel mapping related to the observation point. We also design a heuristic algorithm for parameter estimation that requires no calculation and is dimension independent. The advantages of the proposed machine learning method are illustrated using multidimensional test data as well as a specific example focused on cancer cell evolution.  ( 2 min )
    CogDL: A Toolkit for Deep Learning on Graphs. (arXiv:2103.00959v3 [cs.SI] UPDATED)
    Deep learning on graphs has attracted tremendous attention from the graph learning community in recent years. It has been widely adopted in various real-world applications from diverse domains, such as social and information networks, biological graphs, and molecular graphs. In this paper, we present CogDL--an extensive toolkit for deep learning on graphs--that allows researchers and developers to easily conduct experiments and build applications. In CogDL, we propose a unified design for the training loop of graph neural network (GNN) models, making it unique among existing graph learning libraries. By utilizing this unified trainer, we can optimize the GNN training loop with several training techniques such as distributed training and mixed precision training. Moreover, we develop efficient sparse operators for CogDL, enabling it to become the most competitive graph library for efficiency. Additionally, another important CogDL feature is its focus on ease of use with the goal of facilitating open, robust, and reproducible graph learning research. We leverage CogDL to report and maintain benchmark results on the fundamental graph tasks such as node classification and graph classification, which can be reproduced and directly used by the community. Finally, we demonstrate the effectiveness and efficiency of CogDL for real-world applications in AMiner--a large-scale academic mining and search system. The CogDL toolkit is available at: https://github.com/thudm/cogdl.  ( 2 min )
    Terminal Embeddings in Sublinear Time. (arXiv:2110.08691v2 [cs.DS] UPDATED)
    Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} In the case that $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside is that evaluating the embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in $m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.  ( 2 min )
    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v3 [stat.ML] UPDATED)
    Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.  ( 2 min )
    On Robustness of Neural Ordinary Differential Equations. (arXiv:1910.05513v4 [cs.LG] UPDATED)
    Neural ordinary differential equations (ODEs) have been attracting increasing attention in various research domains recently. There have been some works studying optimization issues and approximation capabilities of neural ODEs, but their robustness is still yet unclear. In this work, we fill this important gap by exploring robustness properties of neural ODEs both empirically and theoretically. We first present an empirical study on the robustness of the neural ODE-based networks (ODENets) by exposing them to inputs with various types of perturbations and subsequently investigating the changes of the corresponding outputs. In contrast to conventional convolutional neural networks (CNNs), we find that the ODENets are more robust against both random Gaussian perturbations and adversarial attack examples. We then provide an insightful understanding of this phenomenon by exploiting a certain desirable property of the flow of a continuous-time ODE, namely that integral curves are non-intersecting. Our work suggests that, due to their intrinsic robustness, it is promising to use neural ODEs as a basic block for building robust deep network models. To further enhance the robustness of vanilla neural ODEs, we propose the time-invariant steady neural ODE (TisODE), which regularizes the flow on perturbed data via the time-invariant property and the imposition of a steady-state constraint. We show that the TisODE method outperforms vanilla neural ODEs and also can work in conjunction with other state-of-the-art architectural methods to build more robust deep networks.  ( 2 min )
    Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. (arXiv:1903.08568v4 [cs.DS] UPDATED)
    We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $\nu = e^{-f}$ on $\mathbb{R}^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $\nu$ satisfies a log-Sobolev inequality and the Hessian of $f$ is bounded. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in R\'enyi divergence of order $q > 1$ assuming the limit of ULA satisfies either the log-Sobolev or Poincar\'e inequality. We also prove a bound on the bias of the limiting distribution of ULA assuming third-order smoothness of $f$, without requiring isoperimetry.  ( 2 min )
    Generalization Guarantees for Neural Architecture Search with Train-Validation Split. (arXiv:2104.14132v3 [stat.ML] UPDATED)
    Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for high-performance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (inner problem) and various hyperparameters such as the configuration of the architecture over the validation data (outer problem). This paper explores the statistical aspects of such problems with train-validation splits. In practice, the inner problem is often overparameterized and can easily achieve zero loss. Thus, a-priori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of train-validation split. To this aim this work establishes the following results. (1) We show that refined properties of the validation loss such as risk and hyper-gradients are indicative of those of the true test loss. This reveals that the outer problem helps select the most generalizable model and prevent overfitting with a near-minimal validation sample size. This is established for continuous search spaces which are relevant for differentiable schemes. Extensions to transfer learning are developed in terms of the mismatch between training & validation distributions. (2) We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradient-descent, we show that the train-validation procedure returns the best (model, architecture) pair even if all architectures can perfectly fit the training data to achieve zero error. (3) Finally, we highlight connections between NAS, multiple kernel learning, and low-rank matrix learning. The latter leads to novel insights where the solution of the outer problem can be accurately learned via efficient spectral methods to achieve near-minimal risk.  ( 2 min )
    Neural Galerkin Scheme with Active Learning for High-Dimensional Evolution Equations. (arXiv:2203.01360v1 [math.NA])
    Machine learning methods have been shown to give accurate predictions in high dimensions provided that sufficient training data are available. Yet, many interesting questions in science and engineering involve situations where initially no data are available and the principal aim is to gather insights from a known model. Here we consider this problem in the context of systems whose evolution can be described by partial differential equations (PDEs). We use deep learning to solve these equations by generating data on-the-fly when and where they are needed, without prior information about the solution. The proposed Neural Galerkin schemes derive nonlinear dynamical equations for the network weights by minimization of the residual of the time derivative of the solution, and solve these equations using standard integrators for initial value problems. The sequential learning of the weights over time allows for adaptive collection of new input data for residual estimation. This step uses importance sampling informed by the current state of the solution, in contrast with other machine learning methods for PDEs that optimize the network parameters globally in time. This active form of data acquisition is essential to enable the approximation power of the neural networks and to break the curse of dimensionality faced by non-adaptative learning strategies. The applicability of the method is illustrated on several numerical examples involving high-dimensional PDEs, including advection equations with many variables, as well as Fokker-Planck equations for systems with several interacting particles.  ( 2 min )
    Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation. (arXiv:2106.10865v2 [stat.ML] UPDATED)
    The literature on "benign overfitting" in overparameterized models has been mostly restricted to regression or binary classification; however, modern machine learning operates in the multiclass setting. Motivated by this discrepancy, we study benign overfitting in multiclass linear classification. Specifically, we consider the following training algorithms on separable data: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. First, we provide a simple sufficient deterministic condition under which all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. When the data is generated from Gaussian mixtures or a multinomial logistic model, this condition holds under high enough effective overparameterization. We also show that this sufficient condition is satisfied under "neural collapse", a phenomenon that is observed in training deep neural networks. Second, we derive novel bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.  ( 2 min )

  • Open

    A New Deep Learning Study Investigate and Clarify the Intrinsic Behavior of Transformers in Computer Vision
    In recent years, Transformers have overcome classic Convolutional Neural Networks (CNNs) and have rapidly become the state-of-the-art in many vision tasks. This power has always been explained as a consequence of the multi-head self-attention (MSA), and, in particular, its weak inductive bias and ability to capture long-range dependencies. But due to this over-flexibility, Vision Transformers (ViTs) have a tendency to (apparently) overfit training datasets, consequently leading to poor predictive performances in small data regimes. However, most of the concepts related to Transformer have been inferred empirically, and until now there has been no in-depth study of their inherent behavior. In this paper, the NAVER AI Lab and Yonsei University filled this lack and solve several doubts by investigating Vision Transformer in relation to three fundamental questions: 1) What properties of MSAs do we need to better optimize NNs? 2) Do MSAs behave like Convs (convolutional layers)? If not, how are they different? 3) How can MSAs be harmonized with Convs (convolutional layers)? Continue Reading My Summary of the paper Paper: https://arxiv.org/pdf/2202.06709v1.pdf Github: https://github.com/xxxnell/how-do-vits-work https://preview.redd.it/6jlvpwx9egl81.png?width=555&format=png&auto=webp&s=7cc3b6f846ad12c5e6c5cc41eaae9bef1442dcf7 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI generated watercolor paintings
    https://www.youtube.com/watch?v=uMR1iR65Zqg https://www.youtube.com/watch?v=GLiFW4MlwHU https://www.youtube.com/watch?v=3JwkaUStups submitted by /u/Recent_Coffee_2551 [link] [comments]
    Trees video created using AI
    submitted by /u/Recent_Coffee_2551 [link] [comments]
    Looking to build a Team Interested in building AI technology to integrate to my cloud
    Please dm to hear more submitted by /u/RemoteNo9370 [link] [comments]
    Princeton University Researchers Introduce ‘DataMUX’: A Technique That Enables Deep Neural Networks to Process Multiple Inputs Simultaneously Using a Single Compact Representation
    A Princeton University research team proposed Data Multiplexing for Neural Networks in their new study DataMUX: Data Multiplexing for Neural Networks (DataMUX). This new method allows neural networks to analyze several inputs simultaneously and make correct predictions, improving model throughput while requiring minimal additional memory. Because of its unique capacity to represent complicated real-life issues, deep neural networks (DNNs) are a famous architecture in the machine learning field. Recent research suggests that such networks are grossly overparameterized, necessitating significant and possibly wasteful increases in model size and compute burden to construct modern DNNs. Continue Reading This Summary on This Research Here Paper: https://arxiv.org/pdf/2202.09318.pdf Github: https://github.com/princeton-nlp/datamux https://preview.redd.it/g37jkm5auel81.png?width=1288&format=png&auto=webp&s=5cba9ad2c080c70e6c46186fe041f7326ae4ccb5 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    How Do Chatbots Help in Improving your Brand Experience?
    submitted by /u/mihircontra20 [link] [comments]
    SINGULOS RESEARCH announces Perceptus platform for augmented reality — realtime object comprehension and tracking of physical objects on-device
    submitted by /u/SpatialComputing [link] [comments]  ( 2 min )
    Build Movies Recommendation Engine in Apache Spark
    submitted by /u/bigdataengineer4life [link] [comments]
    Researchers from Tel Aviv Propose Long-Text NLP Benchmark Called SCROLLS
    Although lengthier texts contain a significant quantity of natural language in the wild, NLP benchmarks have always primarily focused on short texts, such as sentences and paragraphs. Short text classification has consistently been a driving force behind standard benchmarks like GLUE, WMT, and SQuAD. A considerable amount of natural language is produced in the context of lengthier discourses, such as books, essays, and meeting transcripts, as is widely known. As a result, model structures are required to address the computing constraints associated with processing such long sequences. Researchers from Tel-Aviv University, Meta AI, IBM Research, and Allen Institute for AI (AI2) introduce Standardized CompaRison Over Long Language Sequences (SCROLLS) to tackle this issue. SCROLLS is a collection of summarization, question-answering, and natural language inference tasks that span a variety of topics, including literature, science, commerce, and entertainment. Continue Reading My Full Summary On This Paper Paper: https://arxiv.org/pdf/2201.03533.pdf https://preview.redd.it/8e5mbc3slal81.png?width=1920&format=png&auto=webp&s=e2a762237b0c6dd2fc8da561c091e64d0ebaf916 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Artificial Nightmares : Trypophobia Forrest || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Decisions Part 1: Creating an AI-driven Decision Factory
    This is part 1 in a multi-part series on the value-realizing, collaborative power of decisions. My mom used to say, “If it was a snake, you’d be dead”. And the reason that I say that, is that organizations are seeking a collaborative value driver that can 1) align the organization around the economic power of… Read More »Decisions Part 1: Creating an AI-driven Decision Factory The post Decisions Part 1: Creating an AI-driven Decision Factory appeared first on Data Science Central.  ( 5 min )
  • Open

    “Rearranging” neural networks.
    Is it possible to take a neural network with for example 2 inputs and 1 output and rearrange it so that the inputs are O1 and I1 and I2 becomes the output. submitted by /u/Soggy-Statistician88 [link] [comments]  ( 1 min )
    Has anyone used Spiking Neural Networks (SNNs) for image processing?
    As the title says... has anyone used Spiking Neural Networks (SNNs) for image processing? More specifically Medical image processing? There seems to be very little on SNNs being used in image processing. I would appreciate input from anyone who has experience with this or if anyone knows of any good Github repositories for this. submitted by /u/bangwhosnext [link] [comments]  ( 1 min )
    Princeton University Researchers Introduce ‘DataMUX’: A Technique That Enables Deep Neural Networks to Process Multiple Inputs Simultaneously Using a Single Compact Representation
    A Princeton University research team proposed Data Multiplexing for Neural Networks in their new study DataMUX: Data Multiplexing for Neural Networks (DataMUX). This new method allows neural networks to analyze several inputs simultaneously and make correct predictions, improving model throughput while requiring minimal additional memory. Because of its unique capacity to represent complicated real-life issues, deep neural networks (DNNs) are a famous architecture in the machine learning field. Recent research suggests that such networks are grossly overparameterized, necessitating significant and possibly wasteful increases in model size and compute burden to construct modern DNNs. Continue Reading This Summary on This Research Here Paper: https://arxiv.org/pdf/2202.09318.pdf Github: https://github.com/princeton-nlp/datamux https://preview.redd.it/i0ynmrn9uel81.png?width=1288&format=png&auto=webp&s=13cda8ebdb6a0e25b573753417376271cd43ea5f submitted by /u/No_Coffee_4638 [link] [comments]
    Library?
    Besides FANN is there a C library for Linux? submitted by /u/mikeegg1 [link] [comments]
  • Open

    Good day, How does RL learn the differences between 2 patient subgroups and come up with an optimal policy for each: E.g Patients with chest congestion due to covid and one without covid?
    Good day, How does RL learn the differences between 2 patient subgroups and come up with an optimal policy for each: E.g Patients with chest congestion due to covid and one without covid? submitted by /u/Thin-Ad9581 [link] [comments]  ( 1 min )
    Hard to solve Deepmind Control Suite pixel-based finger turn_easy task
    Hi, I used stable-baselines3(TD3, SAC,DDPG, PPO) even DrQ(author's repository) to solve DMC finger turn_easy task. But I disappointed almost all trained fingers are almost motionless. they usually got 1000 or 0 points in one episode even though after 100k frames and running mean reward was 200~300. But DrQ paper shows they can get 1000 rewards at almost 100k frames. In my point of view, It happened by the fact that the episode is not finished even if the agent does not receive good reward until 1000 steps. Can you tell me some implementation that can solve finger turn_easy? Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
    Application of Deep Reinforcement Learning for Operations Research problems
    Hello everyone! I am new in this community and extremely glad to find it :) I have been looking into solution methods for problems I am working in the area of Operations Research, in particular, on-demand delivery systems(eg. uber eats), I want to make use of the knowledge of previous deliveries to increase the efficiency of the system, but the methods that are used to OR problems generally i.e Evolutionary Algorithms don't seem to do that, of course, one can incorporate some methods inside the algorithm to make use of previous data, but I find reinforcement learning as a better approach for these kinds of problems. I would like to know if anyone of you has used RL to solve similar problems? Also if you could lead me to some resources. I would love to have a conversation regarding this as well! :) Thanks. submitted by /u/ImportantSurround [link] [comments]  ( 2 min )
  • Open

    [D] Has anyone used Spiking Neural Networks (SNNs) for image processing?
    As the title says... has anyone used Spiking Neural Networks (SNNs) for image processing? More specifically Medical image processing? There seems to be very little on SNNs being used in image processing. I would appreciate input from anyone who has experience with this or if anyone knows of any good Github repositories for this. submitted by /u/bangwhosnext [link] [comments]  ( 1 min )
    [D] Does the amount of learning a model does affect train time, or is it only the number of examples?
    Say a model is trained on a dataset that is completely randomly generated, and a separate instance of that model is trained on a dataset of the same size but where the features actually have predictive power for the response variable. Given that the same exact model and dataset are used, should the training time be equal in both cases? Or will one be lower than the other, since in one dataset there is actually something for the model to learn whereas the other one is random numbers? Does it depend on which model you're using, or maybe even what parameters you've set with your chosen model? submitted by /u/woodenlathe [link] [comments]  ( 1 min )
    Subjective beauty / aesthetically pleasing labeled image dataset [P]
    I’m looking for a dataset of images, could be of anything, that are labeled as aesthetically pleasing or not according to some human reviewers. Maybe on a scale from 1-10. Maybe it’s art, photographs, real estate photos, etc. I have a couple different ideas about how I could try and infer this by scraping different sources, but thought id check here to see if anyone knows of such a dataset submitted by /u/foreignparent [link] [comments]  ( 1 min )
    [P] Multi-video classification
    Hi, I want to build a classifier which uses multiple videos. I know video classifiers exist, but they rely on a single video. The obvious solution is to create an ensemble classifier which uses the outputs of classifiers trained on a single video to provide multi-video classification. For example, training a MLP or XGBoost on the outputs of the pre-softmax layer. While this is ok, I was wondering whether there are any more elegant/specialized end-to-end solutions for this? submitted by /u/rsandler [link] [comments]  ( 1 min )
    Current proposals for defense against deepfakes? [Discussion]
    I stumbled upon this article and realized how dangerous deepfakes could be: https://www.12newsnow.com/article/news/nation-world/ukraine/ukraine-warns-russia-may-deploy-deepfake-videos/507-4f18ea66-d5c7-4dac-9d41-2aea6495efdf The thought originally occurred to me after taking MIT's deep learning class with the Obama deep fake. Are there any prevention measures to counter them? I know even detecting a photoshopped photo is hard, although there is progress. So I'd imagine it would be incredibly hard to use computer vision to detect deep fakes. EXIF data isn't reliable either and it's easily deleted. Maybe in the future, each image would have to come with some sort of validation? Thoughts? submitted by /u/Calm_Ad_343 [link] [comments]  ( 1 min )
    [D] CycleGAN generators generating their own domain?
    I've been trying to train a cycleGAN that, succinctly, converts outlines into realistic looking objects given that outline, and vice-versa. During training, I monitor what each generator is outputting. To define a few terms: G_A should generate an object (domain B) given an outline (domain A) G_B should generate an outline (domain A) given an object (domain B) I've found that the networks always seem to "converge" to a state where G_A generates images belonging to domain A (instead of B) and the same for G_B. Basically, the generators are learning an identity mapping, transforming A -> A, and B -> B. This is exceedingly bizarre as the discriminator should be penalising this behaviour heavily, but the learning of this identity does get stronger over time, the generators are actually training and converging, even if it's to the wrong solution. The loss is your standard LSGAN loss, with cycle terms, and identity terms (this last one involves passing A to G_B and comparing to A, and vice-versa). I've trained GANs successfully before and I've never encountered this before, any advice? submitted by /u/Pedimus [link] [comments]  ( 1 min )
    [D] 7 pages for a long paper in ACL or other related conferences
    For most ML related conference, it asks you to submit a paper with up to 8 pages long. Is it ok if I submit a paper that is 7 to 7.5 pages long? Will that raise any suspicion among the reviewers because the paper is too short? submitted by /u/SiegeMemeLord [link] [comments]  ( 1 min )
    Hey all, I'm Sebastian Raschka, author of Machine Learning with Pytorch and Scikit-Learn. Please feel free to ask me anything!
    Hello everyone. I am excited about the invitation to do an AMA here. It's my first AMA on reddit, and I will be trying my best! I recently wrote the "Machine Learning with Pytorch and Scikit-Learn" book and joined a startup(Grid.ai) in January. I am also an Assistant Professor of Statistics at the University of Wisconsin-Madison since 2018. Btw. I am also a very passionate Python programmer and love open source. Please feel free to ask me anything about my book, working in industry (although my experience is still limited, haha), academia, or my research projects. But also don't hesitate to go on tangents and ask about other things -- this is an ask me anything after all (... topics like cross-country skiing come to mind). EDIT: Thanks everyone for making my first AMA here a really fun experience! Unfortunately, I have to call it a day, but I had a good time! Thanks for all the good questions, and sorry that I couldn't get to all of them! submitted by /u/seraschka [link] [comments]  ( 9 min )
    [N] 10 open PhD positions at TU Berlin FG Machine Learning
    (copy & paste from website) The positions are part of the Graduate School of the 'Berlin Institute for the Foundations of Learning and Data' (BIFOLD). BIFOLD conducts scalable agile fundamental research in the field of AI, primarily in the areas of data management (DM) and machine learning (ML). The research groups also address the new challenges and demands evoked by the rapidly growing importance of big data management and machine learning in virtually all domains, from medicine, industry, natural sciences, humanities, e-commerce and media, to government and societal affairs. 5 positions each are to be filled within the focus areas of DM (headed by Prof. Volker Markl) and ML (headed by Prof. Klaus-Robert Müller). Working field: Based on the overarching research foci of BIFOLD, the Gr…  ( 1 min )
    [D] How to contribute to open source ML and DL without having access to high quality setup?
    Hi. I love working on ML/DL problems, but one of the main drawbacks is this constant need to access to quality setup and high performance computing clusters for many meaningful tasks. As a person who is just starting to create a portfolio for myself and only have access to the likes of Colab (ultimately colab Pro), and cannot afford to create my own setup, what are my options? Many jobs require a decent level of familiarity with tools that you can't usually set up without working in a self-made environment. I also live in a third world country as a student and renting online clusters and etc is really expensive for me. So the question becomes what are some potential routes/paths for someone like me who wants to learn new skills and contribute, but really cannot find what to focus with the current resources? Thanks! submitted by /u/quit_daedalus [link] [comments]  ( 2 min )
    [D] Uncertainty estimation with calibration set (with MC Dropout)
    I'm building a regressor where I'd like to have some uncertainty estimates for my predictions. If a point is far from the training data manifold I expect very high uncertainty so I know that I can't rely on the prediction, i.e. I'm interested in epistemic uncertainty. The problem is that I can't directly define a reliable distance between my data points so I can not simply use something like GP or KNN. I encode my inputs using a neural network and add a regression head on top to provide the predictions. I tried to use MC Dropout to sample multiple predictions during evaluation and use them to estimate the probability distribution. The problem is that there is no observable difference between the variance estimated for the training samples and validation samples. Is it even a valid approach? The more I think about it the more I doubt how is it even possible that this method could yield well calibrated results. It seems unreasonable to expect good calibration without some sort of calibration set, as NN can overfit anything. Am I missing something? submitted by /u/alterframe [link] [comments]  ( 4 min )
    [R] Evolving Curricula with Regret-Based Environment Design
    https://accelagent.github.io/ submitted by /u/_rockt [link] [comments]
    [D] Manipulating class distributions based on holdout test set data...seems wrong, but maybe it's right?
    I have a set of data that spans from 2019 to current day (with some gaps in 2021). So far I've been using 2019-2021 data to train/cross validate and 2022 data as my hold out test set. The results on the test set are significantly worse than on the training/CV set. At first I suspected overfitting (as you would), but after playing around with how the datasets are constructed based on time of data capture, I concluded that overfitting is NOT the problem. The actual problem seems to be that in the older data, my positive class is more common than in the most recent data. For example, say in the 2020 data, I could have a 3:1 negative:positive label distribution, whereas in the 2022 data, I could have a 5:1 label distribution. If my training set only sees the 3:1 data, it's no wonder that it…  ( 5 min )
    A software to analyze and classificate Higgs-boson decays using machine learning with Python [R]
    Hi guys, I am PhD student in particle physics and I want to share with you something I worked some months ago with another collegue from Germany. We developed a software to analyze the Higgs-boson open-data from Kaggle (see Higgs Boson ML challenge) and we performed a machine learning analysis in order to separate the signal process (H->tautaut) with respect to the other background processes. We basically used deep neural networks (DNNs) and boosted decision trees (BDTs). We finally compared the obtained outputs of the two models through a logistic regression and validate the result with a metric called arithmetic-mean-significance (AMS). We then used the PyROOT framework to create cool results plots. We basically wrote the code using Jupyter Notebooks only, but later I decided to provide a python script version too in the repository. We wrote also a small paper to explain our work and we presented it at the International School on High-Energy physics. Below you can find the Github repository, if you like it don't forget to leave a star! Girhub repository: JustWhit3/higgs-decay-classification submitted by /u/TheCompiler95 [link] [comments]  ( 1 min )
    [D] The Perils of Ensembling Multiple Models
    Three other ML developers have trained and delivered models. The inputs are essentially images and there are three output classes. I've been asked to ensemble the three models. I'm to add a new head and leave the 3 models frozen. This means the new head has only 9 inputs and 3 outputs. Basically, all I feel that I can build are low capacity models with 3 or 4 layers. There's no departure of validation accuracy compared to training accuracy indicating overfitting. No matter what I do, this model trains and tests with holdout data to over 95%. I find that extremely suspicious with so few parameters. I also know that two classes are about 60/40 and the third class is rare. What should I do to convince myself that this model is actually OK, or conversely that it's crap? submitted by /u/Simusid [link] [comments]  ( 2 min )
    [D] Difference between Machine Learning in Physics Research and big tech companies
    Is there any significant difference between the machine learning used in physics research (molecular design, turbulence modeling, etc.) and big tech companies (Google, Facebook, etc.)? submitted by /u/Exotic-Dust2008 [link] [comments]
  • Open

    Unlocking new doors to artificial intelligence
    MEng graduate students engage with IBM to develop their research skills and solutions to real-world problems.  ( 6 min )
  • Open

    Build a cold start time series forecasting engine using AutoGluon
    Whether you’re allocating resources more efficiently for web traffic, forecasting patient demand for staffing needs, or anticipating sales of a company’s products, forecasting is an essential tool across many businesses. One particular use case, known as cold start forecasting, builds forecasts for a time series that has little or no existing historical data, such as […]  ( 6 min )
  • Open

    Why Is Adding Features Like Virtual Tour in a Real Estate App Be Fruitful?
    Real estate businesses have existed for many years and will almost definitely continue to succeed.  ( 4 min )
  • Open

    Explanation of Machine Learning Models Using Shapley Additive Explanation and Application for Real Data in Hospital. (arXiv:2112.11071v2 [cs.LG] UPDATED)
    When using machine learning techniques in decision-making processes, the interpretability of the models is important. In the present paper, we adopted the Shapley additive explanation (SHAP), which is based on fair profit allocation among many stakeholders depending on their contribution, for interpreting a gradient-boosting decision tree model using hospital data. For better interpretability, we propose two novel techniques as follows: (1) a new metric of feature importance using SHAP and (2) a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of the model. We then compared the explanation results between the SHAP framework and existing methods. In addition, we showed how the A/G ratio works as an important prognostic factor for cerebral infarction using our hospital data and proposed techniques.  ( 2 min )
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v4 [cs.LG] UPDATED)
    This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.  ( 2 min )
    On Label Shift in Domain Adaptation via Wasserstein Distance. (arXiv:2110.15520v2 [cs.LG] UPDATED)
    We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA under various DA settings (e.g., closed-set, partial-set, open-set, and universal settings). Inspired from the developed theory, we propose Label and Data Shift Reduction via Optimal Transport (LDROT) which can mitigate the data and label shifts simultaneously. Finally, we conduct comprehensive experiments to verify our theoretical findings and compare LDROT with state-of-the-art baselines.  ( 2 min )
    Computationally Efficient and Statistically Optimal Robust Low-rank Matrix Estimation. (arXiv:2203.00953v1 [math.ST])
    Low-rank matrix estimation under heavy-tailed noise is challenging, both computationally and statistically. Convex approaches have been proven statistically optimal but suffer from high computational costs, especially since robust loss functions are usually non-smooth. More recently, computationally fast non-convex approaches via sub-gradient descent are proposed, which, unfortunately, fail to deliver a statistically consistent estimator even under sub-Gaussian noise. In this paper, we introduce a novel Riemannian sub-gradient (RsGrad) algorithm which is not only computationally efficient with linear convergence but also is statistically optimal, be the noise Gaussian or heavy-tailed. Convergence theory is established for a general framework and specific applications to absolute loss, Huber loss and quantile loss are investigated. Compared with existing non-convex methods, ours reveals a surprising phenomenon of dual-phase convergence. In phase one, RsGrad behaves as in a typical non-smooth optimization that requires gradually decaying stepsizes. However, phase one only delivers a statistically sub-optimal estimator which is already observed in existing literature. Interestingly, during phase two, RsGrad converges linearly as if minimizing a smooth and strongly convex objective function and thus a constant stepsize suffices. Underlying the phase-two convergence is the smoothing effect of random noise to the non-smooth robust losses in an area close but not too close to the truth. Numerical simulations confirm our theoretical discovery and showcase the superiority of RsGrad over prior methods.  ( 2 min )
    Adversarially Robust Learning with Tolerance. (arXiv:2203.00849v1 [stat.ML])
    We study the problem of tolerant adversarial PAC learning with respect to metric perturbation sets. In adversarial PAC learning, an adversary is allowed to replace a test point $x$ with an arbitrary point in a closed ball of radius $r$ centered at $x$. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius $(1+\gamma)r$. For perturbation sets with doubling dimension $d$, we show that a variant of the natural ``perturb-and-smooth'' algorithm PAC learns any hypothesis class $\mathcal{H}$ with VC dimension $v$ in the $\gamma$-tolerant adversarial setting with $O\left(\frac{v(1+1/\gamma)^{O(d)}}{\varepsilon}\right)$ samples. This is the first such general guarantee with linear dependence on $v$ even for the special case where the domain is the real line and the perturbation sets are closed balls (intervals) of radius $r$. However, the proposed guarantees for the perturb-and-smooth algorithm currently only hold in the tolerant robust realizable setting and exhibit exponential dependence on $d$. We additionally propose an alternative learning method which yields sample complexity bounds with only linear dependence on the doubling dimension even in the more general agnostic case. This approach is based on sample compression.  ( 2 min )
    Modeling Extremes with d-max-decreasing Neural Networks. (arXiv:2102.09042v2 [stat.ML] UPDATED)
    We propose a novel neural network architecture that enables non-parametric calibration and generation of multivariate extreme value distributions (MEVs). MEVs arise from Extreme Value Theory (EVT) as the necessary class of models when extrapolating a distributional fit over large spatial and temporal scales based on data observed in intermediate scales. In turn, EVT dictates that $d$-max-decreasing, a stronger form of convexity, is an essential shape constraint in the characterization of MEVs. As far as we know, our proposed architecture provides the first class of non-parametric estimators for MEVs that preserve these essential shape constraints. We show that our architecture approximates the dependence structure encoded by MEVs at parametric rate. Moreover, we present a new method for sampling high-dimensional MEVs using a generative model. We demonstrate our methodology on a wide range of experimental settings, ranging from environmental sciences to financial mathematics and verify that the structural properties of MEVs are retained compared to existing methods.  ( 2 min )
    Adaptive Granularity in Tensors: A Quest for Interpretable Structure. (arXiv:1912.09009v2 [cs.LG] UPDATED)
    Data collected at very frequent intervals is usually extremely sparse and has no structure that is exploitable by modern tensor decomposition algorithms. Thus the utility of such tensors is low, in terms of the amount of interpretable and exploitable structure that one can extract from them. In this paper, we introduce the problem of finding a tensor of adaptive aggregated granularity that can be decomposed to reveal meaningful latent concepts (structures) from datasets that, in their original form, are not amenable to tensor analysis. Such datasets fall under the broad category of sparse point processes that evolve over space and/or time. To the best of our knowledge, this is the first work that explores adaptive granularity aggregation in tensors. Furthermore, we formally define the problem and discuss what different definitions of "good structure" can be in practice, and show that optimal solution is of prohibitive combinatorial complexity. Subsequently, we propose an efficient and effective greedy algorithm called IceBreaker, which follows a number of intuitive decision criteria that locally maximize the "goodness of structure", resulting in high-quality tensors. We evaluate our method on synthetic, semi-synthetic and real datasets. In all the cases, our proposed method constructs tensors that have very high structure quality.  ( 2 min )
    An Analysis of Ensemble Sampling. (arXiv:2203.01303v1 [cs.LG])
    Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable. In this paper, we establish a Bayesian regret bound that ensures desirable behavior when ensemble sampling is applied to the linear bandit problem. This represents the first rigorous regret analysis of ensemble sampling and is made possible by leveraging information-theoretic concepts and novel analytic techniques that may prove useful beyond the scope of this paper.  ( 2 min )
    Principled Deep Neural Network Training through Linear Programming. (arXiv:1810.03218v3 [cs.LG] UPDATED)
    Deep learning has received much attention lately due to the impressive empirical performance achieved by training algorithms. Consequently, a need for a better theoretical understanding of these problems has become more evident in recent years. In this work, using a unified framework, we show that there exists a polyhedron which encodes simultaneously all possible deep neural network training problems that can arise from a given architecture, activation functions, loss function, and sample-size. Notably, the size of the polyhedral representation depends only linearly on the sample-size, and a better dependency on several other network parameters is unlikely (assuming $P\neq NP$). Additionally, we use our polyhedral representation to obtain new and better computational complexity results for training problems of well-known neural network architectures. Our results provide a new perspective on training problems through the lens of polyhedral theory and reveal a strong structure arising from these problems.  ( 2 min )
    Online Competitive Influence Maximization. (arXiv:2006.13411v4 [cs.LG] UPDATED)
    Online influence maximization has attracted much attention as a way to maximize influence spread through a social network while learning the values of unknown network parameters. Most previous works focus on single-item diffusion. In this paper, we introduce a new Online Competitive Influence Maximization (OCIM) problem, where two competing items (e.g., products, news stories) propagate in the same network and influence probabilities on edges are unknown. We adopt a combinatorial multi-armed bandit (CMAB) framework for OCIM, but unlike the non-competitive setting, the important monotonicity property (influence spread increases when influence probabilities on edges increase) no longer holds due to the competitive nature of propagation, which brings a significant new challenge to the problem. We provide a nontrivial proof showing that the Triggering Probability Modulated (TPM) condition for CMAB still holds in OCIM, which is instrumental for our proposed algorithms OCIM-TS and OCIM-OFU to achieve sublinear Bayesian and frequentist regret, respectively. We also design an OCIM-ETC algorithm that requires less feedback and easier offline computation, at the expense of a worse frequentist regret bound. Experimental evaluations demonstrate the effectiveness of our algorithms.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v3 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    Bayesian adaptive and interpretable functional regression for exposure profiles. (arXiv:2203.00784v1 [stat.ME])
    Pollutant exposures during gestation are a known and adverse factor for birth and health outcomes. However, the links between prenatal air pollution exposures and educational outcomes are less clear, in particular the critical windows of susceptibility during pregnancy. Using a large cohort of students in North Carolina, we study prenatal $\mbox{PM}_{2.5}$ exposures recorded at near-continuous resolutions and linked to 4th end-of-grade reading scores. We develop a locally-adaptive Bayesian regression model for scalar responses with functional and scalar predictors. The proposed model pairs a B-spline basis expansion with dynamic shrinkage priors to capture both smooth and rapidly-changing features in the regression surface. The local adaptivity is manifested in more accurate point estimates and more precise uncertainty quantification than existing methods on simulated data. The model is accompanied by a highly scalable Gibbs sampler for fully Bayesian inference on large datasets. In addition, we describe broad limitations with the interpretability of scalar-on-function regression models, and introduce new decision analysis tools to guide the model interpretation. Using these methods, we identify a period within the third trimester as the critical window of susceptibility to $\mbox{PM}_{2.5}$ exposure.  ( 2 min )
    Enhanced Nearest Neighbor Classification for Crowdsourcing. (arXiv:2203.00781v1 [cs.HC])
    In machine learning, crowdsourcing is an economical way to label a large amount of data. However, the noise in the produced labels may deteriorate the accuracy of any classification method applied to the labelled data. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Two algorithms are developed to estimate the worker quality (which is often unknown in practice): one is to construct the estimate based on the denoised worker labels by applying the $k$NN classifier to the expert data; the other is an iterative algorithm that works even without access to the expert data. Other than strong numerical evidence, our proposed methods are proven to achieve the same regret as its oracle version based on high-quality expert data. As a technical by-product, a lower bound on the sample size assigned to each worker to reach the optimal convergence rate of regret is derived.  ( 2 min )
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v1 [math.OC])
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate against the best stabilizing linear controller. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.  ( 2 min )
    The Optimal Noise in Noise-Contrastive Learning Is Not What You Think. (arXiv:2203.01110v1 [stat.ML])
    Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion. This setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data's and even from a different family.  ( 2 min )
    Beyond GAP screening for Lasso by exploiting new dual cutting half-spaces with supplementary material. (arXiv:2203.00987v1 [cs.LG])
    In this paper, we propose a novel safe screening test for Lasso. Our procedure is based on a safe region with a dome geometry and exploits a canonical representation of the set of half-spaces (referred to as "dual cutting half-spaces" in this paper) containing the dual feasible set. The proposed safe region is shown to be always included in the state-of-the-art "GAP Sphere" and "GAP Dome" proposed by Fercoq et al. (and strictly so under very mild conditions) while involving the same computational burden. Numerical experiments confirm that our new dome enables to devise more powerful screening tests than GAP regions and lead to significant acceleration to solve Lasso.  ( 2 min )
    PUMA: Performance Unchanged Model Augmentation for Training Data Removal. (arXiv:2203.00846v1 [stat.ML])
    Preserving the performance of a trained model while removing unique characteristics of marked training data points is challenging. Recent research usually suggests retraining a model from scratch with remaining training data or refining the model by reverting the model optimization on the marked data points. Unfortunately, aside from their computational inefficiency, those approaches inevitably hurt the resulting model's generalization ability since they remove not only unique characteristics but also discard shared (and possibly contributive) information. To address the performance degradation problem, this paper presents a novel approach called Performance Unchanged Model Augmentation~(PUMA). The proposed PUMA framework explicitly models the influence of each training data point on the model's generalization ability with respect to various performance criteria. It then complements the negative impact of removing marked data by reweighting the remaining data optimally. To demonstrate the effectiveness of the PUMA framework, we compared it with multiple state-of-the-art data removal techniques in the experiments, where we show the PUMA can effectively and efficiently remove the unique characteristics of marked training data without retraining the model that can 1) fool a membership attack, and 2) resist performance degradation. In addition, as PUMA estimates the data importance during its operation, we show it could serve to debug mislabelled data points more efficiently than existing approaches.
    Imitation of Manipulation Skills Using Multiple Geometries. (arXiv:2203.01171v1 [cs.RO])
    Daily manipulation tasks are characterized by regular characteristics associated with the task structure, which can be described by multiple geometric primitives related to actions and object shapes. Such geometric descriptors can not be expressed only in Cartesian coordinate systems. In this paper, we propose a learning approach to extract the optimal representation from a dictionary of coordinate systems to represent an observed movement. This is achieved by using an extension of Gaussian distributions on Riemannian manifolds, which is used to analyse a set of user demonstrations statistically, by considering multiple geometries as candidate representations of the task. We formulate the reproduction problem as a general optimal control problem based on an iterative linear quadratic regulator (iLQR), where the Gaussian distribution in the extracted coordinate systems are used to define the cost function. We apply our approach to grasping and box opening tasks in simulation and on a 7-axis Franka Emika robot. The results show that the robot can exploit several geometries to execute the manipulation task and generalize it to new situations, by maintaining the invariant features of the skill in the coordinate system(s) of interest.
    Learning Conditional Variational Autoencoders with Missing Covariates. (arXiv:2203.01218v1 [stat.ML])
    Conditional variational autoencoders (CVAEs) are versatile deep generative models that extend the standard VAE framework by conditioning the generative model with auxiliary covariates. The original CVAE model assumes that the data samples are independent, whereas more recent conditional VAE models, such as the Gaussian process (GP) prior VAEs, can account for complex correlation structures across all data samples. While several methods have been proposed to learn standard VAEs from partially observed datasets, these methods fall short for conditional VAEs. In this work, we propose a method to learn conditional VAEs from datasets in which auxiliary covariates can contain missing values as well. The proposed method augments the conditional VAEs with a prior distribution for the missing covariates and estimates their posterior using amortised variational inference. At training time, our method marginalises the uncertainty associated with the missing covariates while simultaneously maximising the evidence lower bound. We develop computationally efficient methods to learn CVAEs and GP prior VAEs that are compatible with mini-batching. Our experiments on simulated datasets as well as on a clinical trial study show that the proposed method outperforms previous methods in learning conditional VAEs from non-temporal, temporal, and longitudinal datasets.
    Optimality and complexity of classification by random projection. (arXiv:2108.06339v2 [cs.LG] UPDATED)
    The generalization error of a classifier is related to the complexity of the set of functions among which the classifier is chosen. We study a family of low-complexity classifiers consisting of thresholding a random one-dimensional feature. The feature is obtained by projecting the data on a random line after embedding it into a higher-dimensional space parametrized by monomials of order up to k. More specifically, the extended data is projected n-times and the best classifier among those n, based on its performance on training data, is chosen. We show that this type of classifier is extremely flexible, as it is likely to approximate, to an arbitrary precision, any continuous function on a compact set as well as any boolean function on a compact set that splits the support into measurable subsets. In particular, given full knowledge of the class conditional densities, the error of these low-complexity classifiers would converge to the optimal (Bayes) error as k and n go to infinity. On the other hand, if only a training dataset is given, we show that the classifiers will perfectly classify all the training points as k and n go to infinity. We also bound the generalization error of our random classifiers. In general, our bounds are better than those for any classifier with VC dimension greater than O (ln n) . In particular, our bounds imply that, unless the number of projections n is extremely large, there is a significant advantageous gap between the generalization error of the random projection approach and that of a linear classifier in the extended space. Asymptotically, as the number of samples approaches infinity, the gap persists for any such n. Thus, there is a potentially large gain in generalization properties by selecting parameters at random, rather than optimization.
    Chained Generalisation Bounds. (arXiv:2203.00977v1 [stat.ML])
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.
    A Unifying Framework for Some Directed Distances in Statistics. (arXiv:2203.00863v1 [math.ST])
    Density-based directed distances -- particularly known as divergences -- between probability distributions are widely used in statistics as well as in the adjacent research fields of information theory, artificial intelligence and machine learning. Prominent examples are the Kullback-Leibler information distance (relative entropy) which e.g. is closely connected to the omnipresent maximum likelihood estimation method, and Pearson's chisquare-distance which e.g. is used for the celebrated chisquare goodness-of-fit test. Another line of statistical inference is built upon distribution-function-based divergences such as e.g. the prominent (weighted versions of) Cramer-von Mises test statistics respectively Anderson-Darling test statistics which are frequently applied for goodness-of-fit investigations; some more recent methods deal with (other kinds of) cumulative paired divergences and closely related concepts. In this paper, we provide a general framework which covers in particular both the above-mentioned density-based and distribution-function-based divergence approaches; the dissimilarity of quantiles respectively of other statistical functionals will be included as well. From this framework, we structurally extract numerous classical and also state-of-the-art (including new) procedures. Furthermore, we deduce new concepts of dependence between random variables, as alternatives to the celebrated mutual information. Some variational representations are discussed, too.
    On the Impossibility of Convergence of Mixed Strategies with No Regret Learning. (arXiv:2012.02125v3 [cs.GT] UPDATED)
    We study the limiting behavior of the mixed strategies that result from optimal no-regret learning strategies in a repeated game setting where the stage game is any 2 by 2 competitive game. We consider optimal no-regret algorithms that are mean-based and monotonic in their argument. We show that for any such algorithm, the limiting mixed strategies of the players cannot converge almost surely to any Nash equilibrium. This negative result is also shown to hold under a broad relaxation of these assumptions, including popular variants of Online-Mirror-Descent with optimism and/or adaptive step-sizes. Finally, we conjecture that the monotonicity assumption can be removed, and provide partial evidence for this conjecture. Our results identify the inherent stochasticity in players' realizations as a critical factor underlying this divergence in outcomes between using the opponent's mixtures and realizations to make updates.
    Asymptotic Normality of Log Likelihood Ratio and Fundamental Limit of the Weak Detection for Spiked Wigner Matrices. (arXiv:2203.00821v1 [math.ST])
    We consider the problem of detecting the presence of a signal in a rank-one spiked Wigner model. Assuming that the signal is drawn from the Rademacher prior, we prove that the log likelihood ratio of the spiked model against the null model converges to a Gaussian when the signal-to-noise ratio is below a certain threshold. From the mean and the variance of the limiting Gaussian, we also compute the limit of the sum of the Type-I error and the Type-II error of the likelihood ratio test.
    Are Latent Factor Regression and Sparse Regression Adequate?. (arXiv:2203.01219v1 [stat.ME])
    We propose the Factor Augmented sparse linear Regression Model (FARM) that not only encompasses both the latent factor regression and sparse linear regression as special cases but also bridges dimension reduction and sparse regression together. We provide theoretical guarantees for the estimation of our model under the existence of sub-Gaussian and heavy-tailed noises (with bounded (1+x)-th moment, for all x>0), respectively. In addition, the existing works on supervised learning often assume the latent factor regression or the sparse linear regression is the true underlying model without justifying its adequacy. To fill in such an important gap, we also leverage our model as the alternative model to test the sufficiency of the latent factor regression and the sparse linear regression models. To accomplish these goals, we propose the Factor-Adjusted de-Biased Test (FabTest) and a two-stage ANOVA type test respectively. We also conduct large-scale numerical experiments including both synthetic and FRED macroeconomics data to corroborate the theoretical properties of our methods. Numerical results illustrate the robustness and effectiveness of our model against latent factor regression and sparse linear regression models.
    Partial Likelihood Thompson Sampling. (arXiv:2203.00820v1 [stat.ME])
    We consider the problem of deciding how best to target and prioritize existing vaccines that may offer protection against new variants of an infectious disease. Sequential experiments are a promising approach; however, challenges due to delayed feedback and the overall ebb and flow of disease prevalence make available method inapplicable for this task. We present a method, partial likelihood Thompson sampling, that can handle these challenges. Our method involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event. To test our approach, we ran a semi-synthetic experiment based on 200 days of COVID-19 infection data in the US.
    Model-agnostic out-of-distribution detection using combined statistical tests. (arXiv:2203.01097v1 [stat.ML])
    We present simple methods for out-of-distribution detection using a trained generative model. These techniques, based on classical statistical tests, are model-agnostic in the sense that they can be applied to any differentiable generative model. The idea is to combine a classical parametric test (Rao's score test) with the recently introduced typicality test. These two test statistics are both theoretically well-founded and exploit different sources of information based on the likelihood for the typicality test and its gradient for the score test. We show that combining them using Fisher's method overall leads to a more accurate out-of-distribution test. We also discuss the benefits of casting out-of-distribution detection as a statistical testing problem, noting in particular that false positive rate control can be valuable for practical out-of-distribution detection. Despite their simplicity and generality, these methods can be competitive with model-specific out-of-distribution detection algorithms without any assumptions on the out-distribution.
    Sensor-Based Estimation of Dim Light Melatonin Onset (DLMO) Using Features of Two Time Scales. (arXiv:1908.07483v5 [cs.LG] UPDATED)
    Circadian rhythms influence multiple essential biological activities including sleep, performance, and mood. The dim light melatonin onset (DLMO) is the gold standard for measuring human circadian phase (i.e., timing). The collection of DLMO is expensive and time-consuming since multiple saliva or blood samples are required overnight in special conditions, and the samples must then be assayed for melatonin. Recently, several computational approaches have been designed for estimating DLMO. These methods collect daily sampled data (e.g., sleep onset/offset times) or frequently sampled data (e.g., light exposure/skin temperature/physical activity collected every minute) to train learning models for estimating DLMO. One limitation of these studies is that they only leverage one time-scale data. We propose a two-step framework for estimating DLMO using data from both time scales. The first step summarizes data from before the current day, while the second step combines this summary with frequently sampled data of the current day. We evaluate three moving average models that input sleep timing data as the first step and use recurrent neural network models as the second step. The results using data from 207 undergraduates show that our two-step model with two time-scale features has statistically significantly lower root-mean-square errors than models that use either daily sampled data or frequently sampled data.
    A new perspective on low-rank optimization. (arXiv:2105.05947v2 [math.OC] UPDATED)
    A key question in many low-rank problems throughout optimization, machine learning, and statistics is to characterize the convex hulls of simple low-rank sets and judiciously apply these convex hulls to obtain strong yet computationally tractable convex relaxations. We invoke the matrix perspective function - the matrix analog of the perspective function - and characterize explicitly the convex hull of epigraphs of simple matrix convex functions under low-rank constraints. Further, we combine the matrix perspective function with orthogonal projection matrices-the matrix analog of binary variables which capture the row-space of a matrix-to develop a matrix perspective reformulation technique that reliably obtains strong relaxations for a variety of low-rank problems, including reduced rank regression, non-negative matrix factorization, and factor analysis. Moreover, we establish that these relaxations can be modeled via semidefinite constraints and thus optimized over tractably. The proposed approach parallels and generalizes the perspective reformulation technique in mixed-integer optimization and leads to new relaxations for a broad class of problems.
    On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features. (arXiv:2203.01238v1 [cs.LG])
    When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. This phenomenon is called Neural Collapse (NC), which seems to take place regardless of the choice of loss functions. In this work, we justify NC under the mean squared error (MSE) loss, where recent empirical evidence shows that it performs comparably or even better than the de-facto cross-entropy loss. Under a simplified unconstrained feature model, we provide the first global landscape analysis for vanilla nonconvex MSE loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Furthermore, we justify the usage of rescaled MSE loss by probing the optimization landscape around the NC solutions, showing that the landscape can be improved by tuning the rescaling hyperparameters. Finally, our theoretical findings are experimentally verified on practical network architectures.
    Canonical foliations of neural networks: application to robustness. (arXiv:2203.00922v1 [stat.ML])
    Adversarial attack is an emerging threat to the trustability of machine learning. Understanding these attacks is becoming a crucial task. We propose a new vision on neural network robustness using Riemannian geometry and foliation theory, and create a new adversarial attack by taking into account the curvature of the data space. This new adversarial attack called the "dog-leg attack" is a two-step approximation of a geodesic in the data space. The data space is treated as a (pseudo) Riemannian manifold equipped with the pullback of the Fisher Information Metric (FIM) of the neural network. In most cases, this metric is only semi-definite and its kernel becomes a central object to study. A canonical foliation is derived from this kernel. The curvature of the foliation's leaves gives the appropriate correction to get a two-step approximation of the geodesic and hence a new efficient adversarial attack. Our attack is tested on a toy example, a neural network trained to mimic the $\texttt{Xor}$ function, and demonstrates better results that the state of the art attack presented by Zhao et al. (2019).
    Institutional Grammar 2.0 Codebook. (arXiv:2008.08937v4 [cs.MA] UPDATED)
    The Grammar of Institutions, or Institutional Grammar, is an established approach to encode policy information in terms of institutional statements based on a set of pre-defined syntactic components. This codebook provides coding guidelines for a revised version of the Institutional Grammar, the Institutional Grammar 2.0 (IG 2.0). IG 2.0 is a specification that aims at facilitating the encoding of policy to meet varying analytical objectives. To this end, it revises the grammar with respect to comprehensiveness, flexibility, and specificity by offering multiple levels of expressiveness (IG Core, IG Extended, IG Logico). In addition to the encoding of regulative statements, it further introduces the encoding of constitutive institutional statements, as well as statements that exhibit both constitutive and regulative characteristics. Introducing those aspects, the codebook initially covers fundamental concepts of IG 2.0, before providing an overview of pre-coding steps relevant for document preparation. Detailed coding guidelines are provided for both regulative and constitutive statements across all levels of expressiveness, along with the encoding guidelines for statements of mixed form -- hybrid and polymorphic institutional statements. The document further provides an overview of taxonomies used in the encoding process and referred to throughout the codebook. The codebook concludes with a summary and discussion of relevant considerations to facilitate the coding process. An initial Reader's Guide helps the reader tailor the content to her interest. Note that this codebook specifically focuses on operational aspects of IG 2.0 in the context of policy coding. Links to additional resources such as the underlying scientific literature (that offers a comprehensive treatment of the underlying theoretical concepts) are referred to in the DOI and the concluding section of the codebook.
    An Accelerated Stochastic Algorithm for Solving the Optimal Transport Problem. (arXiv:2203.00813v1 [stat.ML])
    We propose a novel accelerated stochastic algorithm -- primal-dual accelerated stochastic gradient descent with variance reduction (PDASGD) -- for solving the optimal transport (OT) problem between two discrete distributions. PDASGD can also be utilized to compute for the Wasserstein barycenter (WB) of multiple discrete distributions. In both the OT and WB cases, the proposed algorithm enjoys the best-known convergence rate (in the form of order of computational complexity) in the literature. PDASGD is easy to implement in nature, due to its stochastic property: computation per iteration can be much faster than other non-stochastic counterparts. We carry out numerical experiments on both synthetic and real data; they demonstrate the improved efficiency of PDASGD.
    Discrete Optimal Transport with Independent Marginals is #P-Hard. (arXiv:2203.01161v1 [math.OC])
    We study the computational complexity of the optimal transport problem that evaluates the Wasserstein distance between the distributions of two K-dimensional discrete random vectors. The best known algorithms for this problem run in polynomial time in the maximum of the number of atoms of the two distributions. However, if the components of either random vector are independent, then this number can be exponential in K even though the size of the problem description scales linearly with K. We prove that the described optimal transport problem is #P-hard even if all components of the first random vector are independent uniform Bernoulli random variables, while the second random vector has merely two atoms, and even if only approximate solutions are sought. We also develop a dynamic programming-type algorithm that approximates the Wasserstein distance in pseudo-polynomial time when the components of the first random vector follow arbitrary independent discrete distributions, and we identify special problem instances that can be solved exactly in strongly polynomial time.
    Metalearning Linear Bandits by Prior Update. (arXiv:2107.05320v2 [stat.ML] UPDATED)
    Fully Bayesian approaches to sequential decision-making assume that problem parameters are generated from a known prior. In practice, such information is often lacking. This problem is exacerbated in setups with partial information, where a misspecified prior may lead to poor exploration and performance. In this work we prove, in the context of stochastic linear bandits and Gaussian priors, that as long as the prior is sufficiently close to the true prior, the performance of the applied algorithm is close to that of the algorithm that uses the true prior. Furthermore, we address the task of learning the prior through metalearning, where a learner updates her estimate of the prior across multiple task instances in order to improve performance on future tasks. We provide an algorithm and regret bounds, demonstrate its effectiveness in comparison to an algorithm that knows the correct prior, and support our theoretical results empirically. Our theoretical results hold for a broad class of algorithms, including Thompson Sampling and Information Directed Sampling.
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v1 [stat.ML])
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient subgroups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables medical practitioners to develop effective treatment recommendations tailored to patient subgroups.
    Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. (arXiv:2104.09311v2 [math.OC] UPDATED)
    We study finite-time horizon continuous-time linear-convex reinforcement learning problems in an episodic setting. In this problem, the unknown linear jump-diffusion process is controlled subject to nonsmooth convex costs. We show that the associated linear-convex control problems admit Lipchitz continuous optimal feedback controls and further prove the Lipschitz stability of the feedback controls, i.e., the performance gap between applying feedback controls for an incorrect model and for the true model depends Lipschitz-continuously on the magnitude of perturbations in the model coefficients; the proof relies on a stability analysis of the associated forward-backward stochastic differential equation. We then propose a novel least-squares algorithm which achieves a regret of the order $O(\sqrt{N\ln N})$ on linear-convex learning problems with jumps, where $N$ is the number of learning episodes; the analysis leverages the Lipschitz stability of feedback controls and concentration properties of sub-Weibull random variables. Numerical experiment confirms the convergence and the robustness of the proposed algorithm.
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v2 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
    A density peaks clustering algorithm with sparse search and K-d tree. (arXiv:2203.00973v1 [stat.ML])
    Density peaks clustering has become a nova of clustering algorithm because of its simplicity and practicality. However, there is one main drawback: it is time-consuming due to its high computational complexity. Herein, a density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem. Firstly, a sparse distance matrix is calculated by using K-d tree to replace the original full rank distance matrix, so as to accelerate the calculation of local density. Secondly, a sparse search strategy is proposed to accelerate the computation of relative-separation with the intersection between the set of k nearest neighbors and the set consisting of the data points with larger local density for any data point. Furthermore, a second-order difference method for decision values is adopted to determine the cluster centers adaptively. Finally, experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms. It is proved that the algorithm can effectively reduce the computational complexity. Especially for larger datasets, the efficiency is elevated more remarkably. Moreover, the clustering accuracy is also improved to a certain extent. Therefore, it can be concluded that the overall performance of the newly proposed algorithm is excellent.
    Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits. (arXiv:1807.07623v6 [cs.LG] UPDATED)
    We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.
    A Constrained Optimization Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v1 [math.OC])
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, bilevel problems with multiple inner minimal points remains to be a challenging and open problem. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with the full guarantee of convergence. In this paper, we propose a new approach, which convert the bilevel problem to an equivalent constrained optimization, and then the primal-dual algorithm can be used to solve the problem. Such an approach enjoys a few advantages including (a) addresses the multiple inner minima challenge; (b) features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms; (c) admits the convergence guarantee via constrained nonconvex optimization. Our experiments further demonstrate the desired performance of the proposed approach.
  • Open

    UAV-Aided Decentralized Learning over Mesh Networks. (arXiv:2203.01008v1 [cs.IT])
    Decentralized learning empowers wireless network devices to collaboratively train a machine learning (ML) model relying solely on device-to-device (D2D) communication. It is known that the convergence speed of decentralized optimization algorithms severely depends on the degree of the network connectivity, with denser network topologies leading to shorter convergence time. Consequently, local connectivity of real world mesh networks, due to the limited communication range of its wireless nodes, undermines the efficiency of decentralized learning protocols, rendering them potentially impracticable. In this work we investigate the role of an unmanned aerial vehicle (UAV), used as flying relay, in facilitating decentralized learning procedures in such challenging conditions. We propose an optimized UAV trajectory, that is defined as a sequence of waypoints that the UAV visits sequentially in order to transfer intelligence across sparsely connected group of users. We then provide a series of experiments highlighting the essential role of UAVs in the context of decentralized learning over mesh networks.  ( 2 min )
    Beyond GAP screening for Lasso by exploiting new dual cutting half-spaces with supplementary material. (arXiv:2203.00987v1 [cs.LG])
    In this paper, we propose a novel safe screening test for Lasso. Our procedure is based on a safe region with a dome geometry and exploits a canonical representation of the set of half-spaces (referred to as "dual cutting half-spaces" in this paper) containing the dual feasible set. The proposed safe region is shown to be always included in the state-of-the-art "GAP Sphere" and "GAP Dome" proposed by Fercoq et al. (and strictly so under very mild conditions) while involving the same computational burden. Numerical experiments confirm that our new dome enables to devise more powerful screening tests than GAP regions and lead to significant acceleration to solve Lasso.  ( 2 min )
    Information Gain Propagation: a new way to Graph Active Learning with Soft Labels. (arXiv:2203.01093v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved great success in various tasks, but their performance highly relies on a large number of labeled nodes, which typically requires considerable human effort. GNN-based Active Learning (AL) methods are proposed to improve the labeling efficiency by selecting the most valuable nodes to label. Existing methods assume an oracle can correctly categorize all the selected nodes and thus just focus on the node selection. However, such an exact labeling task is costly, especially when the categorization is out of the domain of individual expert (oracle). The paper goes further, presenting a soft-label approach to AL on GNNs. Our key innovations are: i) relaxed queries where a domain expert (oracle) only judges the correctness of the predicted labels (a binary question) rather than identifying the exact class (a multi-class question), and ii) new criteria of maximizing information gain propagation for active learner with relaxed queries and soft labels. Empirical studies on public datasets demonstrate that our method significantly outperforms the state-of-the-art GNN-based AL methods in terms of both accuracy and labeling cost.  ( 2 min )
    Hyperparameter optimization of data-driven AI models on HPC systems. (arXiv:2203.01112v1 [physics.data-an])
    In the European Center of Excellence in Exascale computing "Research on AI- and Simulation-Based Engineering at Exascale" (CoE RAISE), researchers develop novel, scalable AI technologies towards Exascale. This work exercises High Performance Computing resources to perform large-scale hyperparameter optimization using distributed training on multiple compute nodes. This is part of RAISE's work on data-driven use cases which leverages AI- and HPC cross-methods developed within the project. In response to the demand for parallelizable and resource efficient hyperparameter optimization methods, advanced hyperparameter search algorithms are benchmarked and compared. The evaluated algorithms, including Random Search, Hyperband and ASHA, are tested and compared in terms of both accuracy and accuracy per compute resources spent. As an example use case, a graph neural network model known as MLPF, developed for the task of Machine-Learned Particle-Flow reconstruction in High Energy Physics, acts as the base model for optimization. Results show that hyperparameter optimization significantly increased the performance of MLPF and that this would not have been possible without access to large-scale High Performance Computing resources. It is also shown that, in the case of MLPF, the ASHA algorithm in combination with Bayesian optimization gives the largest performance increase per compute resources spent out of the investigated algorithms.  ( 2 min )
    Improving the Diversity of Bootstrapped DQN via Noisy Priors. (arXiv:2203.01004v1 [cs.LG])
    Q-learning is one of the most well-known Reinforcement Learning algorithms. There have been tremendous efforts to develop this algorithm using neural networks. Bootstrapped Deep Q-Learning Network is amongst one of them. It utilizes multiple neural network heads to introduce diversity into Q-learning. Diversity can sometimes be viewed as the amount of reasonable moves an agent can take at a given state, analogous to the definition of the exploration ratio in RL. Thus, the performance of Bootstrapped Deep Q-Learning Network is deeply connected with the level of diversity within the algorithm. In the original research, it was pointed out that a random prior could improve the performance of the model. In this article, we further explore the possibility of treating priors as a special type of noise and sample priors from a Gaussian distribution to introduce more diversity into this algorithm. We conduct our experiment on the Atari benchmark and compare our algorithm to both the original and other related algorithms. The results show that our modification of the Bootstrapped Deep Q-Learning algorithm achieves significantly higher evaluation scores across different types of Atari games. Thus, we conclude that noisy priors can improve Bootstrapped Deep Q-Learning's performance by ensuring the integrity of diversities.  ( 2 min )
    Multi-Layer Perceptron Neural Network for Improving Detection Performance of Malicious Phishing URLs Without Affecting Other Attack Types Classification. (arXiv:2203.00774v1 [cs.CR])
    The hypothesis here states that neural network algorithms such as Multi-layer Perceptron (MLP) have higher accuracy in differentiating malicious and semi-structured phishing URLs. Compared to classical machine learning algorithms such as Logistic Regression and Multinomial Naive Bayes, the classical algorithms rely heavily on substantial corpus data training and machine learning experts' domain knowledge to perform complex feature engineering. MLP could perform non-linear separable multi-classes classification and focus less on corpus feature training. In addition, backpropagation weight adjustment could learn which features are more important in differentiating phishing from other attack types.  ( 2 min )
    Discontinuous Constituency and BERT: A Case Study of Dutch. (arXiv:2203.01063v1 [cs.CL])
    In this paper, we set out to quantify the syntactic capacity of BERT in the evaluation regime of non-context free patterns, as occurring in Dutch. We devise a test suite based on a mildly context-sensitive formalism, from which we derive grammars that capture the linguistic phenomena of control verb nesting and verb raising. The grammars, paired with a small lexicon, provide us with a large collection of naturalistic utterances, annotated with verb-subject pairings, that serve as the evaluation test bed for an attention-based span selection probe. Our results, backed by extensive analysis, suggest that the models investigated fail in the implicit acquisition of the dependencies examined.  ( 2 min )
    Weakly Supervised Correspondence Learning. (arXiv:2203.00904v1 [cs.RO])
    Correspondence learning is a fundamental problem in robotics, which aims to learn a mapping between state, action pairs of agents of different dynamics or embodiments. However, current correspondence learning methods either leverage strictly paired data -- which are often difficult to collect -- or learn in an unsupervised fashion from unpaired data using regularization techniques such as cycle-consistency -- which suffer from severe misalignment issues. We propose a weakly supervised correspondence learning approach that trades off between strong supervision over strictly paired data and unsupervised learning with a regularizer over unpaired data. Our idea is to leverage two types of weak supervision: i) temporal ordering of states and actions to reduce the compounding error, and ii) paired abstractions, instead of paired data, to alleviate the misalignment problem and learn a more accurate correspondence. The two types of weak supervision are easy to access in real-world applications, which simultaneously reduces the high cost of annotating strictly paired data and improves the quality of the learned correspondence.  ( 2 min )
    Practical Recommendations for the Design of Automatic Fault Detection Algorithms Based on Experiments with Field Monitoring Data. (arXiv:2203.01103v1 [eess.SY])
    Automatic fault detection (AFD) is a key technology to optimize the Operation and Maintenance of photovoltaic (PV) systems portfolios. A very common approach to detect faults in PV systems is based on the comparison between measured and simulated performance. Although this approach has been explored by many authors, due to the lack a common basis for evaluating their performance, it is still unclear what are the influencing aspects in the design of AFD algorithms. In this study, a series of AFD algorithms have been tested under real operating conditions, using monitoring data collected over 58 months on 80 rooftop-type PV systems installed in Germany. The results shown that this type of AFD algorithm have the potential to detect up to 82.8% of the energy losses with specificity above 90%. In general, the higher the simulation accuracy, the higher the specificity. The use of less accurate simulations can increase sensitivity at the cost of decreasing specificity. Analyzing the measurements individually makes the algorithm less sensitive to the simulation accuracy. The use of machine learning clustering algorithm for the statistical analysis showed exceptional ability to prevent false alerts, even in cases where the modeling accuracy is not high. If a slightly higher level of false alerts can be tolerated, the analysis of daily PR using a Shewhart chart provides the high sensitivity with an exceptionally simple solution with no need for more complex algorithms for modeling or clustering.  ( 2 min )
    Runtime Detection of Executional Errors in Robot-Assisted Surgery. (arXiv:2203.00737v1 [cs.CV])
    Despite significant developments in the design of surgical robots and automated techniques for objective evaluation of surgical skills, there are still challenges in ensuring safety in robot-assisted minimally-invasive surgery (RMIS). This paper presents a runtime monitoring system for the detection of executional errors during surgical tasks through the analysis of kinematic data. The proposed system incorporates dual Siamese neural networks and knowledge of surgical context, including surgical tasks and gestures, their distributional similarities, and common error modes, to learn the differences between normal and erroneous surgical trajectories from small training datasets. We evaluate the performance of the error detection using Siamese networks compared to single CNN and LSTM networks trained with different levels of contextual knowledge and training data, using the dry-lab demonstrations of the Suturing and Needle Passing tasks from the JIGSAWS dataset. Our results show that gesture specific task nonspecific Siamese networks obtain micro F1 scores of 0.94 (Siamese-CNN) and 0.95 (Siamese-LSTM), and perform better than single CNN (0.86) and LSTM (0.87) networks. These Siamese networks also outperform gesture nonspecific task specific Siamese-CNN and Siamese-LSTM models for Suturing and Needle Passing.  ( 2 min )
    On-Device Learning: A Neural Network Based Field-Trainable Edge AI. (arXiv:2203.01077v1 [cs.LG])
    In real-world edge AI applications, their accuracy is often affected by various environmental factors, such as noises, location/calibration of sensors, and time-related changes. This article introduces a neural network based on-device learning approach to address this issue without going deep. Our approach is quite different from de facto backpropagation based training but tailored for low-end edge devices. This article introduces its algorithm and implementation on a wireless sensor node consisting of Raspberry Pi Pico and low-power wireless module. Experiments using vibration patterns of rotating machines demonstrate that retraining by the on-device learning significantly improves an anomaly detection accuracy at a noisy environment while saving computation and communication costs for low power.  ( 2 min )
    Discriminating Against Unrealistic Interpolations in Generative Adversarial Networks. (arXiv:2203.01035v1 [cs.LG])
    Interpolations in the latent space of deep generative models is one of the standard tools to synthesize semantically meaningful mixtures of generated samples. As the generator function is non-linear, commonly used linear interpolations in the latent space do not yield the shortest paths in the sample space, resulting in non-smooth interpolations. Recent work has therefore equipped the latent space with a suitable metric to enforce shortest paths on the manifold of generated samples. These are often, however, susceptible of veering away from the manifold of real samples, resulting in smooth but unrealistic generation that requires an additional method to assess the sample quality along paths. Generative Adversarial Networks (GANs), by construction, measure the sample quality using its discriminator network. In this paper, we establish that the discriminator can be used effectively to avoid regions of low sample quality along shortest paths. By reusing the discriminator network to modify the metric on the latent space, we propose a lightweight solution for improved interpolations in pre-trained GANs.  ( 2 min )
    Boosted Ensemble Learning based on Randomized NNs for Time Series Forecasting. (arXiv:2203.00980v1 [cs.LG])
    Time series forecasting is a challenging problem particularly when a time series expresses multiple seasonality, nonlinear trend and varying variance. In this work, to forecast complex time series, we propose ensemble learning which is based on randomized neural networks, and boosted in three ways. These comprise ensemble learning based on residuals, corrected targets and opposed response. The latter two methods are employed to ensure similar forecasting tasks are solved by all ensemble members, which justifies the use of exactly the same base models at all stages of ensembling. Unification of the tasks for all members simplifies ensemble learning and leads to increased forecasting accuracy. This was confirmed in an experimental study involving forecasting time series with triple seasonality, in which we compare our three variants of ensemble boosting. The strong points of the proposed ensembles based on RandNNs are extremely rapid training and pattern-based time series representation, which extracts relevant information from time series.  ( 2 min )
    TableFormer: Table Structure Understanding with Transformers. (arXiv:2203.01017v1 [cs.CV])
    Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a non-trivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-to-end deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.  ( 2 min )
    Defining a synthetic data generator for realistic electric vehicle charging sessions. (arXiv:2203.01129v1 [cs.OH])
    Electric vehicle (EV) charging stations have become prominent in electricity grids in the past years. Analysis of EV charging sessions is useful for flexibility analysis, load balancing, offering incentives to customers, etc. Yet, the limited availability of such EV sessions data hinders further development in these fields. Addressing this need for publicly available and realistic data, we develop a synthetic data generator (SDG) for EV charging sessions. Our SDG assumes the EV inter-arrival time to follow an exponential distribution. Departure times are modeled by defining a conditional probability density function (pdf) for connection times. This pdf for connection time and required energy is fitted by Gaussian mixture models. Since we train our SDG using a large real-world dataset, its output is realistic.  ( 2 min )
    A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management. (arXiv:2203.00885v1 [cs.LG])
    Most existing literature on supply chain and inventory management consider stochastic demand processes with zero or constant lead times. While it is true that in certain niche scenarios, uncertainty in lead times can be ignored, most real-world scenarios exhibit stochasticity in lead times. These random fluctuations can be caused due to uncertainty in arrival of raw materials at the manufacturer's end, delay in transportation, an unforeseen surge in demands, and switching to a different vendor, to name a few. Stochasticity in lead times is known to severely degrade the performance in an inventory management system, and it is only fair to abridge this gap in supply chain system through a principled approach. Motivated by the recently introduced delay-resolved deep Q-learning (DRDQN) algorithm, this paper develops a reinforcement learning based paradigm for handling uncertainty in lead times (\emph{action delay}). Through empirical evaluations, it is further shown that the inventory management with uncertain lead times is not only equivalent to that of delay in information sharing across multiple echelons (\emph{observation delay}), a model trained to handle one kind of delay is capable to handle delays of another kind without requiring to be retrained. Finally, we apply the delay-resolved framework to scenarios comprising of multiple products subjected to stochasticity in lead times, and elucidate how the delay-resolved framework negates the effect of any delay to achieve near-optimal performance.  ( 2 min )
    E-LANG: Energy-Based Joint Inferencing of Super and Swift Language Models. (arXiv:2203.00748v1 [cs.CL])
    Building huge and highly capable language models has been a trend in the past years. Despite their great performance, they incur high computational cost. A common solution is to apply model compression or choose light-weight architectures, which often need a separate fixed-size model for each desirable computational budget, and may lose performance in case of heavy compression. This paper proposes an effective dynamic inference approach, called E-LANG, which distributes the inference between large accurate Super-models and light-weight Swift models. To this end, a decision making module routes the inputs to Super or Swift models based on the energy characteristics of the representations in the latent space. This method is easily adoptable and architecture agnostic. As such, it can be applied to black-box pre-trained models without a need for architectural manipulations, reassembling of modules, or re-training. Unlike existing methods that are only applicable to encoder-only backbones and classification tasks, our method also works for encoder-decoder structures and sequence-to-sequence tasks such as translation. The E-LANG performance is verified through a set of experiments with T5 and BERT backbones on GLUE, SuperGLUE, and WMT. In particular, we outperform T5-11B with an average computations speed-up of 3.3$\times$ on GLUE and 2.9$\times$ on SuperGLUE. We also achieve BERT-based SOTA on GLUE with 3.2$\times$ less computations. Code and demo are available in the supplementary materials.  ( 2 min )
    Chained Generalisation Bounds. (arXiv:2203.00977v1 [stat.ML])
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.  ( 2 min )
    Sampling Random Group Fair Rankings. (arXiv:2203.00887v1 [cs.LG])
    In this paper, we consider the problem of randomized group fair ranking that merges given ranked list of items from different sensitive demographic groups while satisfying given lower and upper bounds on the representation of each group in the top ranks. Our randomized group fair ranking formulation works even when there is implicit bias, incomplete relevance information, or when only ordinal ranking is available instead of relevance scores or utility values. We take an axiomatic approach and show that there is a unique distribution $\mathcal{D}$ to sample a random group fair ranking that satisfies a natural set of consistency and fairness axioms. Moreover, $\mathcal{D}$ satisfies representation constraints for every group at every rank, a characteristic that cannot be satisfied by any deterministic ranking. We propose three algorithms to sample a random group fair ranking from $\mathcal{D}$. Our first algorithm samples rankings from $\mathcal{D}$ exactly, in time exponential in the number of groups. Our second algorithm samples random group fair rankings from $\mathcal{D}$ exactly and is faster than the first algorithm when the gap between upper and lower bounds on the representation for each group is small. Our third algorithm samples rankings from a distribution $\epsilon$-close to $\mathcal{D}$ in total variation distance, and has expected running time polynomial in all input parameters and $1/\epsilon$ when there is a large gap between upper and lower bound representation constraints for all the groups. We experimentally validate the above guarantees of our algorithms for group fairness in top ranks and representation in every rank on real-world data sets.
    Canonical foliations of neural networks: application to robustness. (arXiv:2203.00922v1 [stat.ML])
    Adversarial attack is an emerging threat to the trustability of machine learning. Understanding these attacks is becoming a crucial task. We propose a new vision on neural network robustness using Riemannian geometry and foliation theory, and create a new adversarial attack by taking into account the curvature of the data space. This new adversarial attack called the "dog-leg attack" is a two-step approximation of a geodesic in the data space. The data space is treated as a (pseudo) Riemannian manifold equipped with the pullback of the Fisher Information Metric (FIM) of the neural network. In most cases, this metric is only semi-definite and its kernel becomes a central object to study. A canonical foliation is derived from this kernel. The curvature of the foliation's leaves gives the appropriate correction to get a two-step approximation of the geodesic and hence a new efficient adversarial attack. Our attack is tested on a toy example, a neural network trained to mimic the $\texttt{Xor}$ function, and demonstrates better results that the state of the art attack presented by Zhao et al. (2019).
    Personalized Federated Learning With Structure. (arXiv:2203.00829v1 [cs.LG])
    Knowledge sharing and model personalization are two key components to impact the performance of personalized federated learning (PFL). Existing PFL methods simply treat knowledge sharing as an aggregation of all clients regardless of the hidden relations among them. This paper is to enhance the knowledge-sharing process in PFL by leveraging the structural information among clients. We propose a novel structured federated learning(SFL) framework to simultaneously learn the global model and personalized model using each client's local relations with others and its private dataset. This proposed framework has been formulated to a new optimization problem to model the complex relationship among personalized models and structural topology information into a unified framework. Moreover, in contrast to a pre-defined structure, our framework could be further enhanced by adding a structure learning component to automatically learn the structure using the similarities between clients' models' parameters. By conducting extensive experiments, we first demonstrate how federated learning can be benefited by introducing structural information into the server aggregation process with a real-world dataset, and then the effectiveness of the proposed method has been demonstrated in varying degrees of data non-iid settings.
    The quantum low-rank approximation problem. (arXiv:2203.00811v1 [quant-ph])
    We consider a quantum version of the famous low-rank approximation problem. Specifically, we consider the distance $D(\rho,\sigma)$ between two normalized quantum states, $\rho$ and $\sigma$, where the rank of $\sigma$ is constrained to be at most $R$. For both the trace distance and Hilbert-Schmidt distance, we analytically solve for the optimal state $\sigma$ that minimizes this distance. For the Hilbert-Schmidt distance, the unique optimal state is $\sigma = \tau_R +N_R$, where $\tau_R = \Pi_R \rho \Pi_R$ is given by projecting $\rho$ onto its $R$ principal components with projector $\Pi_R$, and $N_R$ is a normalization factor given by $N_R = \frac{1- \text{Tr}(\tau_R)}{R}\Pi_R$. For the trace distance, this state is also optimal but not uniquely optimal, and we provide the full set of states that are optimal. We briefly discuss how our results have application for performing principal component analysis (PCA) via variational optimization on quantum computers.
    Transfer Learning of High-Fidelity Opacity Spectra in Autoencoders and Surrogate Models. (arXiv:2203.00853v1 [physics.plasm-ph])
    Simulations of high energy density physics are expensive, largely in part for the need to produce non-local thermodynamic equilibrium opacities. High-fidelity spectra may reveal new physics in the simulations not seen with low-fidelity spectra, but the cost of these simulations also scale with the level of fidelity of the opacities being used. Neural networks are capable of reproducing these spectra, but neural networks need data to to train them which limits the level of fidelity of the training data. This paper demonstrates that it is possible to reproduce high-fidelity spectra with median errors in the realm of 3\% to 4\% using as few as 50 samples of high-fidelity Krypton data by performing transfer learning on a neural network trained on many times more low-fidelity data.
    Neuro-Symbolic Verification of Deep Neural Networks. (arXiv:2203.00938v1 [cs.AI])
    Formal verification has emerged as a powerful approach to ensure the safety and reliability of deep neural networks. However, current verification tools are limited to only a handful of properties that can be expressed as first-order constraints over the inputs and output of a network. While adversarial robustness and fairness fall under this category, many real-world properties (e.g., "an autonomous vehicle has to stop in front of a stop sign") remain outside the scope of existing verification technology. To mitigate this severe practical restriction, we introduce a novel framework for verifying neural networks, named neuro-symbolic verification. The key idea is to use neural networks as part of the otherwise logical specification, enabling the verification of a wide variety of complex, real-world properties, including the one above. Moreover, we demonstrate how neuro-symbolic verification can be implemented on top of existing verification infrastructure for neural networks, making our framework easily accessible to researchers and practitioners alike.  ( 2 min )
    FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. (arXiv:2203.00854v1 [cs.LG])
    Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption. In this paper, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on a thorough analysis of AlphaFold's performance. Meanwhile, with \textit{Dynamic Axial Parallelism} and \textit{Duality Async Operation}, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves $7.5\sim9.5\times$ speedup for long-sequence inference. Furthermore, We scaled FastFold to 512 GPUs and achieved an aggregate of 6.02 PetaFLOPs with 90.1\% parallel efficiency. The implementation can be found at https://github.com/hpcaitech/FastFold.  ( 2 min )
    Tricks and Plugins to GBM on Images and Sequences. (arXiv:2203.00761v1 [cs.LG])
    Convolutional neural networks (CNNs) and transformers, which are composed of multiple processing layers and blocks to learn the representations of data with multiple abstract levels, are the most successful machine learning models in recent years. However, millions of parameters and many blocks make them difficult to be trained, and sometimes several days or weeks are required to find an ideal architecture or tune the parameters. Within this paper, we propose a new algorithm for boosting Deep Convolutional Neural Networks (BoostCNN) to combine the merits of dynamic feature selection and BoostCNN, and another new family of algorithms combining boosting and transformers. To learn these new models, we introduce subgrid selection and importance sampling strategies and propose a set of algorithms to incorporate boosting weights into a deep learning architecture based on a least squares objective function. These algorithms not only reduce the required manual effort for finding an appropriate network architecture but also result in superior performance and lower running time. Experiments show that the proposed methods outperform benchmarks on several fine-grained classification tasks.  ( 2 min )
    The Optimal Noise in Noise-Contrastive Learning Is Not What You Think. (arXiv:2203.01110v1 [stat.ML])
    Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion. This setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data's and even from a different family.  ( 2 min )
    Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning. (arXiv:2203.00874v1 [cs.LG])
    Exploration versus exploitation dilemma is a significant problem in reinforcement learning (RL), particularly in complex environments with large state space and sparse rewards. When optimizing for a particular goal, running simple smaller tasks can often be a good way to learn additional information about the environment. Exploration methods have been used to sample better trajectories from the environment for improved performance while auxiliary tasks have been incorporated generally where the reward is sparse. If there is little reward signal available, the agent requires clever exploration strategies to reach parts of the state space that contain relevant sub-goals. However, that exploration needs to be balanced with the need for exploiting the learned policy. This paper explores the idea of combining exploration with auxiliary task learning using General Value Functions (GVFs) and a directed exploration strategy. We provide a simple way to learn options (sequences of actions) instead of having to handcraft them, and demonstrate the performance advantage in three navigation tasks.  ( 2 min )
    Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models. (arXiv:2203.01104v1 [cs.CL])
    The state-of-the-art Mixture-of-Experts (short as MoE) architecture has achieved several remarkable successes in terms of increasing model capacity. However, MoE has been hindered widespread adoption due to complexity, communication costs, and training instability. Here we present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics. It can decompose an original matrix into central tensors (containing the core information) and auxiliary tensors (with only a small proportion of parameters). With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture by sharing a global central tensor across experts and keeping expert-specific auxiliary tensors. We also design the gradient mask strategy for the tensor structure of MPO to alleviate the overfitting problem. Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity (7.26x fewer parameters with the same amount of experts). We additionally demonstrate an improvement in the positive transfer effects of our approach for multi-task learning.  ( 2 min )
    Learning in Sparse Rewards settings through Quality-Diversity algorithms. (arXiv:2203.01027v1 [cs.LG])
    In the Reinforcement Learning (RL) framework, the learning is guided through a reward signal. This means that in situations of sparse rewards the agent has to focus on exploration, in order to discover which action, or set of actions leads to the reward. RL agents usually struggle with this. Exploration is the focus of Quality-Diversity (QD) methods. In this thesis, we approach the problem of sparse rewards with these algorithms, and in particular with Novelty Search (NS). This is a method that only focuses on the diversity of the possible policies behaviors. The first part of the thesis focuses on learning a representation of the space in which the diversity of the policies is evaluated. In this regard, we propose the TAXONS algorithm, a method that learns a low-dimensional representation of the search space through an AutoEncoder. While effective, TAXONS still requires information on when to capture the observation used to learn said space. For this, we study multiple ways, and in particular the signature transform, to encode information about the whole trajectory of observations. The thesis continues with the introduction of the SERENE algorithm, a method that can efficiently focus on the interesting parts of the search space. This method separates the exploration of the search space from the exploitation of the reward through a two-alternating-steps approach. The exploration is performed through NS. Any discovered reward is then locally exploited through emitters. The third and final contribution combines TAXONS and SERENE into a single approach: STAX. Throughout this thesis, we introduce methods that lower the amount of prior information needed in sparse rewards settings. These contributions are a promising step towards the development of methods that can autonomously explore and find high-performance policies in a variety of sparse rewards settings.  ( 2 min )
    L4KDE: Learning for KinoDynamic Tree Expansion. (arXiv:2203.00975v1 [cs.RO])
    We present the Learning for KinoDynamic Tree Expansion (L4KDE) method for kinodynamic planning. Tree-based planning approaches, such as rapidly exploring random tree (RRT), are the dominant approach to finding globally optimal plans in continuous state-space motion planning. Central to these approaches is tree-expansion, the procedure in which new nodes are added into an ever-expanding tree. We study the kinodynamic variants of tree-based planning, where we have known system dynamics and kinematic constraints. In the interest of quickly selecting nodes to connect newly sampled coordinates, existing methods typically cannot optimise to find nodes which have low cost to transition to sampled coordinates. Instead they use metrics like Euclidean distance between coordinates as a heuristic for selecting candidate nodes to connect to the search tree. We propose L4KDE to address this issue. L4KDE uses a neural network to predict transition costs between queried states, which can be efficiently computed in batch, providing much higher quality estimates of transition cost compared to commonly used heuristics while maintaining almost-surely asymptotic optimality guarantee. We empirically demonstrate the significant performance improvement provided by L4KDE on a variety of challenging system dynamics, with the ability to generalise across different instances of the same model class, and in conjunction with a suite of modern tree-based motion planners.  ( 2 min )
    SelfKG: Self-Supervised Entity Alignment in Knowledge Graphs. (arXiv:2203.01044v1 [cs.LG])
    Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing Web-scale KGs. Over the course of its development, the label supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Commonly, the label information (positive entity pairs) is used to supervise the process of pulling the aligned entities in each positive pair closer. However, our theoretical analysis suggests that the learning of entity alignment can actually benefit more from pushing unlabeled negative pairs far away from each other than pulling labeled positive pairs close. By leveraging this discovery, we develop the self-supervised learning objective for entity alignment. We present SelfKG with efficient strategies to optimize this objective for aligning entities without label supervision. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG suggests that self-supervised learning offers great potential for entity alignment in KGs. The code and data are available at https://github.com/THUDM/SelfKG.  ( 2 min )
    ES-dRNN with Dynamic Attention for Short-Term Load Forecasting. (arXiv:2203.00937v1 [cs.LG])
    Short-term load forecasting (STLF) is a challenging problem due to the complex nature of the time series expressing multiple seasonality and varying variance. This paper proposes an extension of a hybrid forecasting model combining exponential smoothing and dilated recurrent neural network (ES-dRNN) with a mechanism for dynamic attention. We propose a new gated recurrent cell -- attentive dilated recurrent cell, which implements an attention mechanism for dynamic weighting of input vector components. The most relevant components are assigned greater weights, which are subsequently dynamically fine-tuned. This attention mechanism helps the model to select input information and, along with other mechanisms implemented in ES-dRNN, such as adaptive time series processing, cross-learning, and multiple dilation, leads to a significant improvement in accuracy when compared to well-established statistical and state-of-the-art machine learning forecasting models. This was confirmed in the extensive experimental study concerning STLF for 35 European countries.  ( 2 min )
    Model-agnostic out-of-distribution detection using combined statistical tests. (arXiv:2203.01097v1 [stat.ML])
    We present simple methods for out-of-distribution detection using a trained generative model. These techniques, based on classical statistical tests, are model-agnostic in the sense that they can be applied to any differentiable generative model. The idea is to combine a classical parametric test (Rao's score test) with the recently introduced typicality test. These two test statistics are both theoretically well-founded and exploit different sources of information based on the likelihood for the typicality test and its gradient for the score test. We show that combining them using Fisher's method overall leads to a more accurate out-of-distribution test. We also discuss the benefits of casting out-of-distribution detection as a statistical testing problem, noting in particular that false positive rate control can be valuable for practical out-of-distribution detection. Despite their simplicity and generality, these methods can be competitive with model-specific out-of-distribution detection algorithms without any assumptions on the out-distribution.  ( 2 min )
    Continual Feature Selection: Spurious Features in Continual Learning. (arXiv:2203.01012v1 [cs.LG])
    Continual Learning (CL) is the research field addressing learning settings where the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. Indeed, we show that learning algorithms solve tasks by overfitting features that are not generalizable. To better understand these phenomena and their impact, we propose a domain incremental scenario that we study through various out-of-distribution generalizations and continual learning algorithms. The experiments of this paper show that continual learning algorithms face two related challenges: (1) the spurious features challenge: some features are well correlated with labels in train data but not in test data due to a covariate shift between train and test. (2) the local spurious features challenge: some features correlate well with labels within a task but not within the whole task sequence. The challenge is to learn general features that are neither spurious (in general) nor locally spurious. We prove that the latter is a major cause of performance decrease in continual learning along with catastrophic forgetting. Our results indicate that the best solution to overcome the feature selection problems varies depending on the correlation between spurious features (SFs) and labels. The vanilla replay approach seems to be a powerful approach to deal with SFs, which could explain its good performance in the continual learning literature. This paper presents a different way of understanding performance decrease in continual learning by describing the influence of spurious/local spurious features.  ( 2 min )
    A density peaks clustering algorithm with sparse search and K-d tree. (arXiv:2203.00973v1 [stat.ML])
    Density peaks clustering has become a nova of clustering algorithm because of its simplicity and practicality. However, there is one main drawback: it is time-consuming due to its high computational complexity. Herein, a density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem. Firstly, a sparse distance matrix is calculated by using K-d tree to replace the original full rank distance matrix, so as to accelerate the calculation of local density. Secondly, a sparse search strategy is proposed to accelerate the computation of relative-separation with the intersection between the set of k nearest neighbors and the set consisting of the data points with larger local density for any data point. Furthermore, a second-order difference method for decision values is adopted to determine the cluster centers adaptively. Finally, experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms. It is proved that the algorithm can effectively reduce the computational complexity. Especially for larger datasets, the efficiency is elevated more remarkably. Moreover, the clustering accuracy is also improved to a certain extent. Therefore, it can be concluded that the overall performance of the newly proposed algorithm is excellent.  ( 2 min )
    Topic Analysis for Text with Side Data. (arXiv:2203.00762v1 [cs.LG])
    Although latent factor models (e.g., matrix factorization) obtain good performance in predictions, they suffer from several problems including cold-start, non-transparency, and suboptimal recommendations. In this paper, we employ text with side data to tackle these limitations. We introduce a hybrid generative probabilistic model that combines a neural network with a latent topic model, which is a four-level hierarchical Bayesian model. In the model, each document is modeled as a finite mixture over an underlying set of topics and each topic is modeled as an infinite mixture over an underlying set of topic probabilities. Furthermore, each topic probability is modeled as a finite mixture over side data. In the context of text, the neural network provides an overview distribution about side data for the corresponding text, which is the prior distribution in LDA to help perform topic grouping. The approach is evaluated on several different datasets, where the model is shown to outperform standard LDA and Dirichlet-multinomial regression (DMR) in terms of topic grouping, model perplexity, classification and comment generation.  ( 2 min )
    Pattern Recognition and Event Detection on IoT Data-streams. (arXiv:2203.01114v1 [cs.LG])
    Big data streams are possibly one of the most essential underlying notions. However, data streams are often challenging to handle owing to their rapid pace and limited information lifetime. It is difficult to collect and communicate stream samples while storing, transmitting and computing a function across the whole stream or even a large segment of it. In answer to this research issue, many streaming-specific solutions were developed. Stream techniques imply a limited capacity of one or more resources such as computing power and memory, as well as time or accuracy limits. Reservoir sampling algorithms choose and store results that are probabilistically significant. A weighted random sampling approach using a generalised sampling algorithmic framework to detect unique events is the key research goal of this work. Briefly, a gradually developed estimate of the joint stream distribution across all feasible components keeps k stream elements judged representative for the full stream. Once estimate confidence is high, k samples are chosen evenly. The complexity is O(min(k,n-k)), where n is the number of items inspected. Due to the fact that events are usually considered outliers, it is sufficient to extract element patterns and push them to an alternate version of k-means as proposed here. The suggested technique calculates the sum of squared errors (SSE) for each cluster, and this is utilised not only as a measure of convergence, but also as a quantification and an indirect assessment of the element distribution's approximation accuracy. This clustering enables for the detection of outliers in the stream based on their distance from the usual event centroids. The findings reveal that weighted sampling and res-means outperform typical approaches for stream event identification. Detected events are shown as knowledge graphs, along with typical clusters of events.  ( 2 min )
    VaiPhy: a Variational Inference Based Algorithm for Phylogeny. (arXiv:2203.01121v1 [q-bio.PE])
    Phylogenetics is a classical methodology in computational biology that today has become highly relevant for medical investigation of single-cell data, e.g., in the context of development of cancer. The exponential size of the tree space is unfortunately a formidable obstacle for current Bayesian phylogenetic inference using Markov chain Monte Carlo based methods since these rely on local operations. And although more recent variational inference (VI) based methods offer speed improvements, they rely on expensive auto-differentiation operations for learning the variational parameters. We propose VaiPhy, a remarkably fast VI based algorithm for approximate posterior inference in an augmented tree space. VaiPhy produces marginal log-likelihood estimates on par with the state-of-the-art methods on real data, and is considerably faster since it does not require auto-differentiation. Instead, VaiPhy combines coordinate ascent update equations with two novel sampling schemes: (i) SLANTIS, a proposal distribution for tree topologies in the augmented tree space, and (ii) the JC sampler, the, to the best of our knowledge, first ever scheme for sampling branch lengths directly from the popular Jukes-Cantor model. We compare VaiPhy in terms of density estimation and runtime. Additionally, we evaluate the reproducibility of the baselines. We provide our code on GitHub: https://github.com/Lagergren-Lab/VaiPhy.  ( 2 min )
    The Theoretical Expressiveness of Maxpooling. (arXiv:2203.01016v1 [cs.LG])
    Over the decade since deep neural networks became state of the art image classifiers there has been a tendency towards less use of max pooling: the function that takes the largest of nearby pixels in an image. Since max pooling featured prominently in earlier generations of image classifiers, we wish to understand this trend, and whether it is justified. We develop a theoretical framework analyzing ReLU based approximations to max pooling, and prove a sense in which max pooling cannot be efficiently replicated using ReLU activations. We analyze the error of a class of optimal approximations, and find that whilst the error can be made exponentially small in the kernel size, doing so requires an exponentially complex approximation. Our work gives a theoretical basis for understanding the trend away from max pooling in newer architectures. We conclude that the main cause of a difference between max pooling and an optimal approximation, a prevalent large difference between the max and other values within pools, can be overcome with other architectural decisions, or is not prevalent in natural images.  ( 2 min )
    MIAShield: Defending Membership Inference Attacks via Preemptive Exclusion of Members. (arXiv:2203.00915v1 [cs.CR])
    In membership inference attacks (MIAs), an adversary observes the predictions of a model to determine whether a sample is part of the model's training data. Existing MIA defenses conceal the presence of a target sample through strong regularization, knowledge distillation, confidence masking, or differential privacy. We propose MIAShield, a new MIA defense based on preemptive exclusion of member samples instead of masking the presence of a member. The key insight in MIAShield is weakening the strong membership signal that stems from the presence of a target sample by preemptively excluding it at prediction time without compromising model utility. To that end, we design and evaluate a suite of preemptive exclusion oracles leveraging model-confidence, exact or approximate sample signature, and learning-based exclusion of member data points. To be practical, MIAShield splits a training data into disjoint subsets and trains each subset to build an ensemble of models. The disjointedness of subsets ensures that a target sample belongs to only one subset, which isolates the sample to facilitate the preemptive exclusion goal. We evaluate MIAShield on three benchmark image classification datasets. We show that MIAShield effectively mitigates membership inference (near random guess) for a wide range of MIAs, achieves far better privacy-utility trade-off compared with state-of-the-art defenses, and remains resilient against an adaptive adversary.  ( 2 min )
    GSC Loss: A Gaussian Score Calibrating Loss for Deep Learning. (arXiv:2203.00833v1 [cs.LG])
    Cross entropy (CE) loss integrated with softmax is an orthodox component in most classification-based frameworks, but it fails to obtain an accurate probability distribution of predicted scores that is critical for further decision-making of poor-classified samples. The prediction score calibration provides a solution to learn the distribution of predicted scores which can explicitly make the model obtain a discriminative representation. Considering the entropy function can be utilized to measure the uncertainty of predicted scores. But, the gradient variation of it is not in line with the expectations of model optimization. To this end, we proposed a general Gaussian Score Calibrating (GSC) loss to calibrate the predicted scores produced by the deep neural networks (DNN). Extensive experiments on over 10 benchmark datasets demonstrate that the proposed GSC loss can yield consistent and significant performance boosts in a variety of visual tasks. Notably, our label-independent GSC loss can be embedded into common improved methods based on the CE loss easily.  ( 2 min )
    A Constrained Optimization Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v1 [math.OC])
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, bilevel problems with multiple inner minimal points remains to be a challenging and open problem. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with the full guarantee of convergence. In this paper, we propose a new approach, which convert the bilevel problem to an equivalent constrained optimization, and then the primal-dual algorithm can be used to solve the problem. Such an approach enjoys a few advantages including (a) addresses the multiple inner minima challenge; (b) features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms; (c) admits the convergence guarantee via constrained nonconvex optimization. Our experiments further demonstrate the desired performance of the proposed approach.
    Learning Robust Real-Time Cultural Transmission without Human Data. (arXiv:2203.00715v1 [cs.LG])
    Cultural transmission is the domain-general social skill that allows agents to acquire and use information from each other in real-time with high fidelity and recall. In humans, it is the inheritance process that powers cumulative cultural evolution, expanding our skills, tools and knowledge across generations. We provide a method for generating zero-shot, high recall cultural transmission in artificially intelligent agents. Our agents succeed at real-time cultural transmission from humans in novel contexts without using any pre-collected human data. We identify a surprisingly simple set of ingredients sufficient for generating cultural transmission and develop an evaluation methodology for rigorously assessing it. This paves the way for cultural evolution as an algorithm for developing artificial general intelligence.
    CD-GAN: a robust fusion-based generative adversarial network for unsupervised change detection between heterogeneous images. (arXiv:2203.00948v1 [eess.IV])
    In the context of Earth observation, the detection of changes is performed from multitemporal images acquired by sensors with possibly different characteristics and modalities. Even when restricting to the optical modality, this task has proved to be challenging as soon as the sensors provide images of different spatial and/or spectral resolutions. This paper proposes a novel unsupervised change detection method dedicated to such so-called heterogeneous optical images. This method capitalizes on recent advances which frame the change detection problem into a robust fusion framework. More precisely, we show that a deep adversarial network designed and trained beforehand to fuse a pair of multiband images can be easily complemented by a network with the same architecture to perform change detection. The resulting overall architecture itself follows an adversarial strategy where the fusion network and the additional network are interpreted as essential building blocks of a generator. A comparison with state-of-the-art change detection methods demonstrate the versatility and the effectiveness of the proposed approach.
    Predicting the Thermal Sunyaev-Zel'dovich Field using Modular and Equivariant Set-Based Neural Networks. (arXiv:2203.00026v1 [astro-ph.CO] CROSS LISTED)
    Theoretical uncertainty limits our ability to extract cosmological information from baryonic fields such as the thermal Sunyaev-Zel'dovich (tSZ) effect. Being sourced by the electron pressure field, the tSZ effect depends on baryonic physics that is usually modeled by expensive hydrodynamic simulations. We train neural networks on the IllustrisTNG-300 cosmological simulation to predict the continuous electron pressure field in galaxy clusters from gravity-only simulations. Modeling clusters is challenging for neural networks as most of the gas pressure is concentrated in a handful of voxels and even the largest hydrodynamical simulations contain only a few hundred clusters that can be used for training. Instead of conventional convolutional neural net (CNN) architectures, we choose to employ a rotationally equivariant DeepSets architecture to operate directly on the set of dark matter particles. We argue that set-based architectures provide distinct advantages over CNNs. For example, we can enforce exact rotational and permutation equivariance, incorporate existing knowledge on the tSZ field, and work with sparse fields as are standard in cosmology. We compose our architecture with separate, physically meaningful modules, making it amenable to interpretation. For example, we can separately study the influence of local and cluster-scale environment, determine that cluster triaxiality has negligible impact, and train a module that corrects for mis-centering. Our model improves by 70 % on analytic profiles fit to the same simulation data. We argue that the electron pressure field, viewed as a function of a gravity-only simulation, has inherent stochasticity, and model this property through a conditional-VAE extension to the network. This modification yields further improvement by 7 %, it is limited by our small training set however. (abridged)  ( 3 min )
    Adversarially Robust Learning with Tolerance. (arXiv:2203.00849v1 [stat.ML])
    We study the problem of tolerant adversarial PAC learning with respect to metric perturbation sets. In adversarial PAC learning, an adversary is allowed to replace a test point $x$ with an arbitrary point in a closed ball of radius $r$ centered at $x$. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius $(1+\gamma)r$. For perturbation sets with doubling dimension $d$, we show that a variant of the natural ``perturb-and-smooth'' algorithm PAC learns any hypothesis class $\mathcal{H}$ with VC dimension $v$ in the $\gamma$-tolerant adversarial setting with $O\left(\frac{v(1+1/\gamma)^{O(d)}}{\varepsilon}\right)$ samples. This is the first such general guarantee with linear dependence on $v$ even for the special case where the domain is the real line and the perturbation sets are closed balls (intervals) of radius $r$. However, the proposed guarantees for the perturb-and-smooth algorithm currently only hold in the tolerant robust realizable setting and exhibit exponential dependence on $d$. We additionally propose an alternative learning method which yields sample complexity bounds with only linear dependence on the doubling dimension even in the more general agnostic case. This approach is based on sample compression.
    Weisfeiler and Leman Go Infinite: Spectral and Combinatorial Pre-Colorings. (arXiv:2201.13410v2 [cs.LG] UPDATED)
    Graph isomorphism testing is usually approached via the comparison of graph invariants. Two popular alternatives that offer a good trade-off between expressive power and computational efficiency are combinatorial (i.e., obtained via the Weisfeiler-Leman (WL) test) and spectral invariants. While the exact power of the latter is still an open question, the former is regularly criticized for its limited power, when a standard configuration of uniform pre-coloring is used. This drawback hinders the applicability of Message Passing Graph Neural Networks (MPGNNs), whose expressive power is upper bounded by the WL test. Relaxing the assumption of uniform pre-coloring, we show that one can increase the expressive power of the WL test ad infinitum. Following that, we propose an efficient pre-coloring based on spectral features that provably increase the expressive power of the vanilla WL test. The above claims are accompanied by extensive synthetic and real data experiments. The code to reproduce our experiments is available at https://github.com/TPFI22/Spectral-and-Combinatorial  ( 2 min )
    Label-Descriptive Patterns and Their Application to Characterizing Classification Errors. (arXiv:2110.09599v2 [cs.LG] UPDATED)
    State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (ie. patterns) that strongly correlate with correct resp. erroneous predictions to obtain a global and interpretable description for arbitrary classifiers. We show this is an instance of the more general label description problem, which we formulate in terms of the Minimum Description Length principle. To discover a good pattern set, we develop the efficient Premise algorithm. Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. Unlike existing solutions, it ably recovers ground truth patterns, even on highly imbalanced data over many features. Through two case studies on Visual Question Answering and Named Entity Recognition, we confirm that Premise gives clear and actionable insight into the systematic errors made by modern NLP classifiers.  ( 2 min )
    Visual Attention Prediction Improves Performance of Autonomous Drone Racing Agents. (arXiv:2201.02569v3 [cs.RO] UPDATED)
    Humans race drones faster than neural networks trained for end-to-end autonomous flight. This may be related to the ability of human pilots to select task-relevant visual information effectively. This work investigates whether neural networks capable of imitating human eye gaze behavior and attention can improve neural network performance for the challenging task of vision-based autonomous drone racing. We hypothesize that gaze-based attention prediction can be an efficient mechanism for visual information selection and decision making in a simulator-based drone racing task. We test this hypothesis using eye gaze and flight trajectory data from 18 human drone pilots to train a visual attention prediction model. We then use this visual attention prediction model to train an end-to-end controller for vision-based autonomous drone racing using imitation learning. We compare the drone racing performance of the attention-prediction controller to those using raw image inputs and image-based abstractions (i.e., feature tracks). Comparing success rates for completing a challenging race track by autonomous flight, our results show that the attention-prediction based controller (88% success rate) outperforms the RGB-image (61% success rate) and feature-tracks (55% success rate) controller baselines. Furthermore, visual attention-prediction and feature-track based models showed better generalization performance than image-based models when evaluated on hold-out reference trajectories. Our results demonstrate that human visual attention prediction improves the performance of autonomous vision-based drone racing agents and provides an essential step towards vision-based, fast, and agile autonomous flight that eventually can reach and even exceed human performances.
    Vector-quantized Image Modeling with Improved VQGAN. (arXiv:2110.04627v2 [cs.CV] UPDATED)
    Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 72.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
    Efficient Meta Subspace Optimization. (arXiv:2110.14920v2 [math.OC] UPDATED)
    Subspace optimization methods have the attractive property of reducing large-scale optimization problems to a sequence of low-dimensional subspace optimization problems. However, existing subspace optimization frameworks adopt a fixed update policy of the subspace and therefore appear to be sub-optimal. In this paper, we propose a new \emph{Meta Subspace Optimization} (MSO) framework for large-scale optimization problems, which allows to determine the subspace matrix at each optimization iteration. In order to remain invariant to the optimization problem's dimension, we design an \emph{efficient} meta optimizer based on very low-dimensional subspace optimization coefficients, inducing a rule-based method that can significantly improve performance. Finally, we design and analyze a reinforcement learning (RL) procedure based on the subspace optimization dynamics whose learnt policies outperform existing subspace optimization methods.
    Explanation of Machine Learning Models Using Shapley Additive Explanation and Application for Real Data in Hospital. (arXiv:2112.11071v2 [cs.LG] UPDATED)
    When using machine learning techniques in decision-making processes, the interpretability of the models is important. In the present paper, we adopted the Shapley additive explanation (SHAP), which is based on fair profit allocation among many stakeholders depending on their contribution, for interpreting a gradient-boosting decision tree model using hospital data. For better interpretability, we propose two novel techniques as follows: (1) a new metric of feature importance using SHAP and (2) a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of the model. We then compared the explanation results between the SHAP framework and existing methods. In addition, we showed how the A/G ratio works as an important prognostic factor for cerebral infarction using our hospital data and proposed techniques.
    Reliable validation of Reinforcement Learning Benchmarks. (arXiv:2203.01075v1 [cs.LG])
    Reinforcement Learning (RL) is one of the most dynamic research areas in Game AI and AI as a whole, and a wide variety of games are used as its prominent test problems. However, it is subject to the replicability crisis that currently affects most algorithmic AI research. Benchmarking in Reinforcement Learning could be improved through verifiable results. There are numerous benchmark environments whose scores are used to compare different algorithms, such as Atari. Nevertheless, reviewers must trust that figures represent truthful values, as it is difficult to reproduce an exact training curve. We propose improving this situation by providing access to the original experimental data to validate study results. To that end, we rely on the concept of minimal traces. These allow re-simulation of action sequences in deterministic RL environments and, in turn, enable reviewers to verify, re-use, and manually inspect experimental results without needing large compute clusters. It also permits validation of presented reward graphs, an inspection of individual episodes, and re-use of result data (baselines) for proper comparison in follow-up papers. We offer plug-and-play code that works with Gym so that our measures fit well in the existing RL and reproducibility eco-system. Our approach is freely available, easy to use, and adds minimal overhead, as minimal traces allow a data compression ratio of up to $\approx 10^4:1$ (94GB to 8MB for Atari Pong) compared to a regular MDP trace used in offline RL datasets. The paper presents proof-of-concept results for a variety of games.
    Predicting the temporal dynamics of turbulent channels through deep learning. (arXiv:2203.00974v1 [physics.flu-dyn])
    The success of recurrent neural networks (RNNs) has been demonstrated in many applications related to turbulence, including flow control, optimization, turbulent features reproduction as well as turbulence prediction and modeling. With this study we aim to assess the capability of these networks to reproduce the temporal evolution of a minimal turbulent channel flow. We first obtain a data-driven model based on a modal decomposition in the Fourier domain (which we denote as FFT-POD) of the time series sampled from the flow. This particular case of turbulent flow allows us to accurately simulate the most relevant coherent structures close to the wall. Long-short-term-memory (LSTM) networks and a Koopman-based framework (KNF) are trained to predict the temporal dynamics of the minimal-channel-flow modes. Tests with different configurations highlight the limits of the KNF method compared to the LSTM, given the complexity of the flow under study. Long-term prediction for LSTM show excellent agreement from the statistical point of view, with errors below 2% for the best models with respect to the reference. Furthermore, the analysis of the chaotic behaviour through the use of the Lyapunov exponents and of the dynamic behaviour through Poincar\'e maps emphasizes the ability of the LSTM to reproduce the temporal dynamics of turbulence. Alternative reduced-order models (ROMs), based on the identification of different turbulent structures, are explored and they continue to show a good potential in predicting the temporal dynamics of the minimal channel.
    CandidateDrug4Cancer: An Open Molecular Graph Learning Benchmark on Drug Discovery for Cancer. (arXiv:2203.00836v1 [cs.LG])
    Anti-cancer drug discoveries have been serendipitous, we sought to present the Open Molecular Graph Learning Benchmark, named CandidateDrug4Cancer, a challenging and realistic benchmark dataset to facilitate scalable, robust, and reproducible graph machine learning research for anti-cancer drug discovery. CandidateDrug4Cancer dataset encompasses multiple most-mentioned 29 targets for cancer, covering 54869 cancer-related drug molecules which are ranged from pre-clinical, clinical and FDA-approved. Besides building the datasets, we also perform benchmark experiments with effective Drug Target Interaction (DTI) prediction baselines using descriptors and expressive graph neural networks. Experimental results suggest that CandidateDrug4Cancer presents significant challenges for learning molecular graphs and targets in practical application, indicating opportunities for future researches on developing candidate drugs for treating cancers.
    Engineering the Neural Automatic Passenger Counter. (arXiv:2203.01156v1 [cs.LG])
    Automatic passenger counting (APC) in public transportation has been approached with various machine learning and artificial intelligence methods since its introduction in the 1970s. While equivalence testing is becoming more popular than difference detection (Student's t-test), the former is much more difficult to pass to ensure low user risk. On the other hand, recent developments in artificial intelligence have led to algorithms that promise much higher counting quality (lower bias). However, gradient-based methods (including Deep Learning) have one limitation: they typically run into local optima. In this work, we explore and exploit various aspects of machine learning to increase reliability, performance, and counting quality. We perform a grid search with several fundamental parameters: the selection and size of the training set, which is similar to cross-validation, and the initial network weights and randomness during the training process. Using this experiment, we show how aggregation techniques such as ensemble quantiles can reduce bias, and we give an idea of the overall spread of the results. We utilize the test success chance, a simulative metric based on the empirical distribution. We also employ a post-training Monte Carlo quantization approach and introduce cumulative summation to turn counting into a stationary method and allow unbounded counts.
    Learning from Imperfect Demonstrations via Adversarial Confidence Transfer. (arXiv:2202.02967v2 [cs.RO] UPDATED)
    Existing learning from demonstration algorithms usually assume access to expert demonstrations. However, this assumption is limiting in many real-world applications since the collected demonstrations may be suboptimal or even consist of failure cases. We therefore study the problem of learning from imperfect demonstrations by learning a confidence predictor. Specifically, we rely on demonstrations along with their confidence values from a different correspondent environment (source environment) to learn a confidence predictor for the environment we aim to learn a policy in (target environment -- where we only have unlabeled demonstrations.) We learn a common latent space through adversarial distribution matching of multi-length partial trajectories to enable the transfer of confidence across source and target environments. The learned confidence reweights the demonstrations to enable learning more from informative demonstrations and discarding the irrelevant ones. Our experiments in three simulated environments and a real robot reaching task demonstrate that our approach learns a policy with the highest expected return.  ( 2 min )
    Parallel Spatio-Temporal Attention-Based TCN for Multivariate Time Series Prediction. (arXiv:2203.00971v1 [cs.LG])
    As industrial systems become more complex and monitoring sensors for everything from surveillance to our health become more ubiquitous, multivariate time series prediction is taking an important place in the smooth-running of our society. A recurrent neural network with attention to help extend the prediction windows is the current-state-of-the-art for this task. However, we argue that their vanishing gradients, short memories, and serial architecture make RNNs fundamentally unsuited to long-horizon forecasting with complex data. Temporal convolutional networks (TCNs) do not suffer from gradient problems and they support parallel calculations, making them a more appropriate choice. Additionally, they have longer memories than RNNs, albeit with some instability and efficiency problems. Hence, we propose a framework, called PSTA-TCN, that combines a parallel spatio-temporal attention mechanism to extract dynamic internal correlations with stacked TCN backbones to extract features from different window sizes. The framework makes full use parallel calculations to dramatically reduce training times, while substantially increasing accuracy with stable prediction windows up to 13 times longer than the status quo.  ( 2 min )
    Faith-Shap: The Faithful Shapley Shapley Interaction Index. (arXiv:2203.00870v1 [cs.LG])
    Shapley values, which were originally designed to assign attributions to individual players in coalition games, have become a commonly used approach in explainable machine learning to provide attributions to input features for black-box machine learning models. A key attraction of Shapley values is that they uniquely satisfy a very natural set of axiomatic properties. However, extending the Shapley value to assigning attributions to interactions rather than individual players, an interaction index, is non-trivial: as the natural set of axioms for the original Shapley values, extended to the context of interactions, no longer specify a unique interaction index. Many proposals thus introduce additional less "natural" axioms, while sacrificing the key axiom of efficiency, in order to obtain unique interaction indices. In this work, rather than introduce additional conflicting axioms, we adopt the viewpoint of Shapley values as coefficients of the most faithful linear approximation to the pseudo-Boolean coalition game value function. By extending linear to $\ell$-order polynomial approximations, we can then define the general family of faithful interaction indices}. We show that by additionally requiring the faithful interaction indices to satisfy interaction-extensions of the standard individual Shapley axioms (dummy, symmetry, linearity, and efficiency), we obtain a unique FaithfulShapley Interaction index, which we denote Faith-Shap, as a natural generalization of the Shapley value to interactions. We then provide some illustrative contrasts of Faith-Shap with previously proposed interaction indices, and further investigate some of its interesting algebraic properties. We further show the computational efficiency of computing Faith-Shap, together with some additional qualitative insights, via some illustrative experiments.  ( 2 min )
    Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering. (arXiv:2112.04312v2 [cs.CV] UPDATED)
    In this work we develop a generalizable and efficient Neural Radiance Field (NeRF) pipeline for high-fidelity free-viewpoint human body synthesis under settings with sparse camera views. Though existing NeRF-based methods can synthesize rather realistic details for human body, they tend to produce poor results when the input has self-occlusion, especially for unseen humans under sparse views. Moreover, these methods often require a large number of sampling points for rendering, which leads to low efficiency and limits their real-world applicability. To address these challenges, we propose a Geometry-guided Progressive NeRF~(GP-NeRF). In particular, to better tackle self-occlusion, we devise a geometry-guided multi-view feature integration approach that utilizes the estimated geometry prior to integrate the incomplete information from input views and construct a complete geometry volume for the target human body. Meanwhile, for achieving higher rendering efficiency, we introduce a geometry-guided progressive rendering pipeline, which leverages the geometric feature volume and the predicted density values to progressively reduce the number of sampling points and speed up the rendering process. Experiments on the ZJU-MoCap and THUman datasets show that our method outperforms the state-of-the-arts significantly across multiple generalization settings, while the time cost is reduced >70% via applying our efficient progressive rendering pipeline.  ( 2 min )
    SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning. (arXiv:2111.14693v3 [cs.RO] UPDATED)
    Building general-purpose robots to perform a diverse range of tasks in a large variety of environments in the physical world at the human level is extremely challenging. It requires the robot learning to be sample-efficient, generalizable, compositional, and incremental. In this work, we introduce a systematic learning framework called SAGCI-system towards achieving these above four requirements. Our system first takes the raw point clouds gathered by the camera mounted on the robot's wrist as the inputs and produces initial modeling of the surrounding environment represented as a file of Unified Robot Description Format (URDF). Our system adopts a learning-augmented differentiable simulation that loads the URDF. The robot then utilizes the interactive perception to interact with the environment to online verify and modify the URDF. Leveraging the differentiable simulation, we propose a model-based learning algorithm combining object-centric and robot-centric stages to efficiently produce policies to accomplish manipulation tasks. We apply our system to perform articulated object manipulation tasks, both in the simulation and the real world. Extensive experiments demonstrate the effectiveness of our proposed learning framework. Supplemental materials and videos are available on https://sites.google.com/view/egci.  ( 2 min )
    Continual Learning of Multi-modal Dynamics with External Memory. (arXiv:2203.00936v1 [cs.LG])
    We study the problem of fitting a model to a dynamical environment when new modes of behavior emerge sequentially. The learning model is aware when a new mode appears, but it does not have access to the true modes of individual training sequences. We devise a novel continual learning method that maintains a descriptor of the mode of an encountered sequence in a neural episodic memory. We employ a Dirichlet Process prior on the attention weights of the memory to foster efficient storage of the mode descriptors. Our method performs continual learning by transferring knowledge across tasks by retrieving the descriptors of similar modes of past tasks to the mode of a current sequence and feeding this descriptor into its transition kernel as control input. We observe the continual learning performance of our method to compare favorably to the mainstream parameter transfer approach.  ( 2 min )
    HyperPrompt: Prompt-based Task-Conditioning of Transformers. (arXiv:2203.00759v1 [cs.CL])
    Prompt-Tuning is a new paradigm for finetuning pre-trained language models in a parameter-efficient way. Here, we explore the use of HyperNetworks to generate hyper-prompts: we propose HyperPrompt, a novel architecture for prompt-based task-conditioning of self-attention in Transformers. The hyper-prompts are end-to-end learnable via generation by a HyperNetwork. HyperPrompt allows the network to learn task-specific feature maps where the hyper-prompts serve as task global memories for the queries to attend to, at the same time enabling flexible information sharing among tasks. We show that HyperPrompt is competitive against strong multi-task learning baselines with as few as $0.14\%$ of additional task-conditioning parameters, achieving great parameter and computational efficiency. Through extensive empirical experiments, we demonstrate that HyperPrompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efficient adapter variants including Prompt-Tuning and HyperFormer++ on Natural Language Understanding benchmarks of GLUE and SuperGLUE across many model sizes.  ( 2 min )
    PUMA: Performance Unchanged Model Augmentation for Training Data Removal. (arXiv:2203.00846v1 [stat.ML])
    Preserving the performance of a trained model while removing unique characteristics of marked training data points is challenging. Recent research usually suggests retraining a model from scratch with remaining training data or refining the model by reverting the model optimization on the marked data points. Unfortunately, aside from their computational inefficiency, those approaches inevitably hurt the resulting model's generalization ability since they remove not only unique characteristics but also discard shared (and possibly contributive) information. To address the performance degradation problem, this paper presents a novel approach called Performance Unchanged Model Augmentation~(PUMA). The proposed PUMA framework explicitly models the influence of each training data point on the model's generalization ability with respect to various performance criteria. It then complements the negative impact of removing marked data by reweighting the remaining data optimally. To demonstrate the effectiveness of the PUMA framework, we compared it with multiple state-of-the-art data removal techniques in the experiments, where we show the PUMA can effectively and efficiently remove the unique characteristics of marked training data without retraining the model that can 1) fool a membership attack, and 2) resist performance degradation. In addition, as PUMA estimates the data importance during its operation, we show it could serve to debug mislabelled data points more efficiently than existing approaches.  ( 2 min )
    Human-robot collaboration and machine learning: a systematic review of recent research. (arXiv:2110.07448v2 [cs.RO] UPDATED)
    Technological progress increasingly envisions the use of robots interacting with people in everyday life. Human-robot collaboration (HRC) is the approach that explores the interaction between a human and a robot, during the completion of a common objective, at the cognitive and physical level. In HRC works, a cognitive model is typically built, which collects inputs from the environment and from the user, elaborates and translates these into information that can be used by the robot itself. Machine learning is a recent approach to build the cognitive model and behavioural block, with high potential in HRC. Consequently, this paper proposes a thorough literature review of the use of machine learning techniques in the context of human-robot collaboration. 45 key papers were selected and analysed, and a clustering of works based on the type of collaborative tasks, evaluation metrics and cognitive variables modelled is proposed. Then, a deep analysis on different families of machine learning algorithms and their properties, along with the sensing modalities used, is carried out. Among the observations, it is outlined the importance of the machine learning algorithms to incorporate time dependencies. The salient features of these works are then cross-analysed to show trends in HRC and give guidelines for future works, comparing them with other aspects of HRC not appeared in the review.  ( 2 min )
    Unrolling SGD: Understanding Factors Influencing Machine Unlearning. (arXiv:2109.13398v2 [cs.LG] UPDATED)
    Machine unlearning is the process through which a deployed machine learning model is made to forget about some of its training data points. While naively retraining the model from scratch is an option, it is almost always associated with large computational overheads for deep learning models. Thus, several approaches to approximately unlearn have been proposed along with corresponding metrics that formalize what it means for a model to forget about a data point. In this work, we first taxonomize approaches and metrics of approximate unlearning. As a result, we identify verification error, i.e., the L2 difference between the weights of an approximately unlearned and a naively retrained model, as an approximate unlearning metric that should be optimized for as it subsumes a large class of other metrics. We theoretically analyze the canonical training algorithm, stochastic gradient descent (SGD), to surface the variables which are relevant to reducing the verification error of approximate unlearning for SGD. From this analysis, we first derive an easy-to-compute proxy for verification error (termed unlearning error). The analysis also informs the design of a new training objective penalty that limits the overall change in weights during SGD and as a result facilitates approximate unlearning with lower verification error. We validate our theoretical work through an empirical evaluation on learning with CIFAR-10, CIFAR-100, and IMDB sentiment analysis.  ( 2 min )
    Learning Efficiently Function Approximation for Contextual MDP. (arXiv:2203.00995v1 [cs.LG])
    We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case where the dynamics is known and unknown, and the case that the dynamics dependent or independent of the context. For all four models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.  ( 2 min )
    Partial Likelihood Thompson Sampling. (arXiv:2203.00820v1 [stat.ME])
    We consider the problem of deciding how best to target and prioritize existing vaccines that may offer protection against new variants of an infectious disease. Sequential experiments are a promising approach; however, challenges due to delayed feedback and the overall ebb and flow of disease prevalence make available method inapplicable for this task. We present a method, partial likelihood Thompson sampling, that can handle these challenges. Our method involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event. To test our approach, we ran a semi-synthetic experiment based on 200 days of COVID-19 infection data in the US.  ( 2 min )
    Leveraging power grid topology in machine learning assisted optimal power flow. (arXiv:2110.00306v2 [cs.LG] UPDATED)
    Machine learning assisted optimal power flow (OPF) aims to reduce the computational complexity of these non-linear and non-convex constrained optimization problems by consigning expensive (online) optimization to offline training. The majority of work in this area typically employs fully connected neural networks (FCNN). However, recently convolutional (CNN) and graph (GNN) neural networks have also been investigated, in effort to exploit topological information within the power grid. Although promising results have been obtained, there lacks a systematic comparison between these architectures throughout literature. Accordingly, we introduce a concise framework for generalizing methods for machine learning assisted OPF and assess the performance of a variety of FCNN, CNN and GNN models for two fundamental approaches in this domain: regression (predicting optimal generator set-points) and classification (predicting the active set of constraints). For several synthetic power grids with interconnected utilities, we show that locality properties between feature and target variables are scarce and subsequently demonstrate marginal utility of applying CNN and GNN architectures compared to FCNN for a fixed grid topology. However, with variable topology (for instance, modeling transmission line contingency), GNN models are able to straightforwardly take the change of topological information into account and outperform both FCNN and CNN models.  ( 2 min )
    Selection, Ignorability and Challenges With Causal Fairness. (arXiv:2202.13774v2 [stat.ML] UPDATED)
    In this paper we look at popular fairness methods that use causal counterfactuals. These methods capture the intuitive notion that a prediction is fair if it coincides with the prediction that would have been made if someone's race, gender or religion were counterfactually different. In order to achieve this, we must have causal models that are able to capture what someone would be like if we were to counterfactually change these traits. However, we argue that any model that can do this must lie outside the particularly well behaved class that is commonly considered in the fairness literature. This is because in fairness settings, models in this class entail a particularly strong causal assumption, normally only seen in a randomised controlled trial. We argue that in general this is unlikely to hold. Furthermore, we show in many cases it can be explicitly rejected due to the fact that samples are selected from a wider population. We show this creates difficulties for counterfactual fairness as well as for the application of more general causal fairness methods.  ( 2 min )
    On Label Shift in Domain Adaptation via Wasserstein Distance. (arXiv:2110.15520v2 [cs.LG] UPDATED)
    We study the label shift problem between the source and target domains in general domain adaptation (DA) settings. We consider transformations transporting the target to source domains, which enable us to align the source and target examples. Through those transformations, we define the label shift between two domains via optimal transport and develop theory to investigate the properties of DA under various DA settings (e.g., closed-set, partial-set, open-set, and universal settings). Inspired from the developed theory, we propose Label and Data Shift Reduction via Optimal Transport (LDROT) which can mitigate the data and label shifts simultaneously. Finally, we conduct comprehensive experiments to verify our theoretical findings and compare LDROT with state-of-the-art baselines.  ( 2 min )
    Metalearning Linear Bandits by Prior Update. (arXiv:2107.05320v2 [stat.ML] UPDATED)
    Fully Bayesian approaches to sequential decision-making assume that problem parameters are generated from a known prior. In practice, such information is often lacking. This problem is exacerbated in setups with partial information, where a misspecified prior may lead to poor exploration and performance. In this work we prove, in the context of stochastic linear bandits and Gaussian priors, that as long as the prior is sufficiently close to the true prior, the performance of the applied algorithm is close to that of the algorithm that uses the true prior. Furthermore, we address the task of learning the prior through metalearning, where a learner updates her estimate of the prior across multiple task instances in order to improve performance on future tasks. We provide an algorithm and regret bounds, demonstrate its effectiveness in comparison to an algorithm that knows the correct prior, and support our theoretical results empirically. Our theoretical results hold for a broad class of algorithms, including Thompson Sampling and Information Directed Sampling.  ( 2 min )
    Stable, accurate and efficient deep neural networks for inverse problems with analysis-sparse models. (arXiv:2203.00804v1 [cs.LG])
    Solving inverse problems is a fundamental component of science, engineering and mathematics. With the advent of deep learning, deep neural networks have significant potential to outperform existing state-of-the-art, model-based methods for solving inverse problems. However, it is known that current data-driven approaches face several key issues, notably instabilities and hallucinations, with potential impact in critical tasks such as medical imaging. This raises the key question of whether or not one can construct stable and accurate deep neural networks for inverse problems. In this work, we present a novel construction of an accurate, stable and efficient neural network for inverse problems with general analysis-sparse models. To construct the network, we unroll NESTA, an accelerated first-order method for convex optimization. Combined with a compressed sensing analysis, we prove accuracy and stability. Finally, a restart scheme is employed to enable exponential decay of the required network depth, yielding a shallower, and consequently more efficient, network. We showcase this approach in the case of Fourier imaging, and verify its stability and performance via a series of numerical experiments. The key impact of this work is to provide theoretical guarantees for computing and developing stable neural networks in practice.  ( 2 min )
    Combining Reinforcement Learning and Optimal Transport for the Traveling Salesman Problem. (arXiv:2203.00903v1 [cs.LG])
    The traveling salesman problem is a fundamental combinatorial optimization problem with strong exact algorithms. However, as problems scale up, these exact algorithms fail to provide a solution in a reasonable time. To resolve this, current works look at utilizing deep learning to construct reasonable solutions. Such efforts have been very successful, but tend to be slow and compute intensive. This paper exemplifies the integration of entropic regularized optimal transport techniques as a layer in a deep reinforcement learning network. We show that we can construct a model capable of learning without supervision and inferences significantly faster than current autoregressive approaches. We also empirically evaluate the benefits of including optimal transport algorithms within deep learning models to enforce assignment constraints during end-to-end training.  ( 2 min )
    TANDEM: Learning Joint Exploration and Decision Making with Tactile Sensors. (arXiv:2203.00798v1 [cs.RO])
    Inspired by the human ability to perform complex manipulation in the complete absence of vision (like retrieving an object from a pocket), the robotic manipulation field is motivated to develop new methods for tactile-based object interaction. However, tactile sensing presents the challenge of being an active sensing modality: a touch sensor provides sparse, local data, and must be used in conjunction with effective exploration strategies in order to collect information. In this work, we focus on the process of guiding tactile exploration, and its interplay with task-related decision making. We propose TANDEM (TActile exploration aNd DEcision Making), an architecture to learn efficient exploration strategies in conjunction with decision making. Our approach is based on separate but co-trained modules for exploration and discrimination. We demonstrate this method on a tactile object recognition task, where a robot equipped with a touch sensor must explore and identify an object from a known set based on tactile feedback alone. TANDEM achieves higher accuracy with fewer actions than alternative methods and is also shown to be more robust to sensor noise.  ( 2 min )
    Mitigating Statistical Bias within Differentially Private Synthetic Data. (arXiv:2108.10934v2 [stat.ML] UPDATED)
    Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacy-compliant augmentation for general applications of synthetic data.  ( 2 min )
    Code Smells in Machine Learning Systems. (arXiv:2203.00803v1 [cs.SE])
    As Deep learning (DL) systems continuously evolve and grow, assuring their quality becomes an important yet challenging task. Compared to non-DL systems, DL systems have more complex team compositions and heavier data dependency. These inherent characteristics would potentially cause DL systems to be more vulnerable to bugs and, in the long run, to maintenance issues. Code smells are empirically tested as efficient indicators of non-DL systems. Therefore, we took a step forward into identifying code smells, and understanding their impact on maintenance in this comprehensive study. This is the first study on investigating code smells in the context of DL software systems, which helps researchers and practitioners to get a first look at what kind of maintenance modification made and what code smells developers have been dealing with. Our paper has three major contributions. First, we comprehensively investigated the maintenance modifications that have been made by DL developers via studying the evolution of DL systems, and we identified nine frequently occurred maintenance-related modification categories in DL systems. Second, we summarized five code smells in DL systems. Third, we validated the prevalence, and the impact of our newly identified code smells through a mixture of qualitative and quantitative analysis. We found that our newly identified code smells are prevalent and impactful on the maintenance of DL systems from the developer's perspective.  ( 2 min )
    Enhanced Nearest Neighbor Classification for Crowdsourcing. (arXiv:2203.00781v1 [cs.HC])
    In machine learning, crowdsourcing is an economical way to label a large amount of data. However, the noise in the produced labels may deteriorate the accuracy of any classification method applied to the labelled data. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Two algorithms are developed to estimate the worker quality (which is often unknown in practice): one is to construct the estimate based on the denoised worker labels by applying the $k$NN classifier to the expert data; the other is an iterative algorithm that works even without access to the expert data. Other than strong numerical evidence, our proposed methods are proven to achieve the same regret as its oracle version based on high-quality expert data. As a technical by-product, a lower bound on the sample size assigned to each worker to reach the optimal convergence rate of regret is derived.  ( 2 min )
    Maximum Mean Discrepancy for Generalization in the Presence of Distribution and Missingness Shift. (arXiv:2111.10344v2 [cs.LG] UPDATED)
    Covariate shifts are a common problem in predictive modeling on real-world problems. This paper proposes addressing the covariate shift problem by minimizing Maximum Mean Discrepancy (MMD) statistics between the training and test sets in either feature input space, feature representation space, or both. We designed three techniques that we call MMD Representation, MMD Mask, and MMD Hybrid to deal with the scenarios where only a distribution shift exists, only a missingness shift exists, or both types of shift exist, respectively. We find that integrating an MMD loss component helps models use the best features for generalization and avoid dangerous extrapolation as much as possible for each test sample. Models treated with this MMD approach show better performance, calibration, and extrapolation on the test set.  ( 2 min )
    FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance. (arXiv:2112.06753v2 [q-fin.TR] UPDATED)
    Deep reinforcement learning (DRL) has shown huge potentials in building financial market simulators recently. However, due to the highly complex and dynamic nature of real-world markets, raw historical financial data often involve large noise and may not reflect the future of markets, degrading the fidelity of DRL-based market simulators. Moreover, the accuracy of DRL-based market simulators heavily relies on numerous and diverse DRL agents, which increases demand for a universe of market environments and imposes a challenge on simulation speed. In this paper, we present a FinRL-Meta framework that builds a universe of market environments for data-driven financial reinforcement learning. First, FinRL-Meta separates financial data processing from the design pipeline of DRL-based strategy and provides open-source data engineering tools for financial big data. Second, FinRL-Meta provides hundreds of market environments for various trading tasks. Third, FinRL-Meta enables multiprocessing simulation and training by exploiting thousands of GPU cores. Our codes are available online at https://github.com/AI4Finance-Foundation/FinRL-Meta.  ( 2 min )
    GAP: Differentially Private Graph Neural Networks with Aggregation Perturbation. (arXiv:2203.00949v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful models designed for graph data that learn node representation by recursively aggregating information from each node's local neighborhood. However, despite their state-of-the-art performance in predictive graph-based applications, recent studies have shown that GNNs can raise significant privacy concerns when graph data contain sensitive information. As a result, in this paper, we study the problem of learning GNNs with Differential Privacy (DP). We propose GAP, a novel differentially private GNN that safeguards the privacy of nodes and edges using aggregation perturbation, i.e., adding calibrated stochastic noise to the output of the GNN's aggregation function, which statistically obfuscates the presence of a single edge (edge-level privacy) or a single node and all its adjacent edges (node-level privacy). To circumvent the accumulation of privacy cost at every forward pass of the model, we tailor the GNN architecture to the specifics of private learning. In particular, we first precompute private aggregations by recursively applying neighborhood aggregation and perturbing the output of each aggregation step. Then, we privately train a deep neural network on the resulting perturbed aggregations for any node-wise classification task. A major advantage of GAP over previous approaches is that we guarantee edge-level and node-level DP not only for training, but also at inference time with no additional costs beyond the training's privacy budget. We theoretically analyze the formal privacy guarantees of GAP using R\'enyi DP. Empirical experiments conducted over three real-world graph datasets demonstrate that GAP achieves a favorable privacy-accuracy trade-off and significantly outperforms existing approaches.  ( 2 min )
    Keeping Minimal Experience to Achieve Efficient Interpretable Policy Distillation. (arXiv:2203.00822v1 [cs.LG])
    Although deep reinforcement learning has become a universal solution for complex control tasks, its real-world applicability is still limited because lacking security guarantees for policies. To address this problem, we propose Boundary Characterization via the Minimum Experience Retention (BCMER), an end-to-end Interpretable Policy Distillation (IPD) framework. Unlike previous IPD approaches, BCMER distinguishes the importance of experiences and keeps a minimal but critical experience pool with almost no loss of policy similarity. Specifically, the proposed BCMER contains two basic steps. Firstly, we propose a novel multidimensional hyperspheres intersection (MHI) approach to divide experience points into boundary points and internal points, and reserve the crucial boundary points. Secondly, we develop a nearest-neighbor-based model to generate robust and interpretable decision rules based on the boundary points. Extensive experiments show that the proposed BCMER is able to reduce the amount of experience to 1.4%~19.1% (when the count of the naive experiences is 10k) and maintain high IPD performance. In general, the proposed BCMER is more suitable for the experience storage limited regime because it discovers the critical experience and eliminates redundant experience.  ( 2 min )
    Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition. (arXiv:2110.03327v2 [eess.AS] UPDATED)
    As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions, the ASR performance and the corresponding confidence estimators may exhibit severe degradation. Since confidence models are often trained on the same in-domain data as the ASR, generalising to out-of-domain (OOD) scenarios is challenging. By keeping the ASR model untouched, this paper proposes two approaches to improve the model-based confidence estimators on OOD data: using pseudo transcriptions and an additional OOD language model. With an ASR model trained on LibriSpeech, experiments show that the proposed methods can greatly improve the confidence metrics on TED-LIUM and Switchboard datasets while preserving in-domain performance. Furthermore, the improved confidence estimators are better calibrated on OOD data and can provide a much more reliable criterion for data selection.  ( 2 min )
    How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data. (arXiv:2109.01300v2 [cs.LG] UPDATED)
    Since training a large-scale backdoored model from scratch requires a large training dataset, several recent attacks have considered to inject backdoors into a trained clean model without altering model behaviors on the clean data. Previous work finds that backdoors can be injected into a trained clean model with Adversarial Weight Perturbation (AWP). Here AWPs refers to the variations of parameters that are small in backdoor learning. In this work, we observe an interesting phenomenon that the variations of parameters are always AWPs when tuning the trained clean model to inject backdoors. We further provide theoretical analysis to explain this phenomenon. We formulate the behavior of maintaining accuracy on clean data as the consistency of backdoored models, which includes both global consistency and instance-wise consistency. We extensively analyze the effects of AWPs on the consistency of backdoored models. In order to achieve better consistency, we propose a novel anchoring loss to anchor or freeze the model behaviors on the clean data, with a theoretical guarantee. Both the analytical and the empirical results validate the effectiveness of the anchoring loss in improving the consistency, especially the instance-wise consistency.  ( 2 min )
    Optimality and complexity of classification by random projection. (arXiv:2108.06339v2 [cs.LG] UPDATED)
    The generalization error of a classifier is related to the complexity of the set of functions among which the classifier is chosen. We study a family of low-complexity classifiers consisting of thresholding a random one-dimensional feature. The feature is obtained by projecting the data on a random line after embedding it into a higher-dimensional space parametrized by monomials of order up to k. More specifically, the extended data is projected n-times and the best classifier among those n, based on its performance on training data, is chosen. We show that this type of classifier is extremely flexible, as it is likely to approximate, to an arbitrary precision, any continuous function on a compact set as well as any boolean function on a compact set that splits the support into measurable subsets. In particular, given full knowledge of the class conditional densities, the error of these low-complexity classifiers would converge to the optimal (Bayes) error as k and n go to infinity. On the other hand, if only a training dataset is given, we show that the classifiers will perfectly classify all the training points as k and n go to infinity. We also bound the generalization error of our random classifiers. In general, our bounds are better than those for any classifier with VC dimension greater than O (ln n) . In particular, our bounds imply that, unless the number of projections n is extremely large, there is a significant advantageous gap between the generalization error of the random projection approach and that of a linear classifier in the extended space. Asymptotically, as the number of samples approaches infinity, the gap persists for any such n. Thus, there is a potentially large gain in generalization properties by selecting parameters at random, rather than optimization.  ( 3 min )
    Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples. (arXiv:2106.14642v3 [cs.LG] UPDATED)
    In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.  ( 2 min )
    Constraining Linear-chain CRFs to Regular Languages. (arXiv:2106.07306v5 [cs.LG] UPDATED)
    A major challenge in structured prediction is to represent the interdependencies within output structures. When outputs are structured as sequences, linear-chain conditional random fields (CRFs) are a widely used model class which can learn \textit{local} dependencies in the output. However, the CRF's Markov assumption makes it impossible for CRFs to represent distributions with \textit{nonlocal} dependencies, and standard CRFs are unable to respect nonlocal constraints of the data (such as global arity constraints on output labels). We present a generalization of CRFs that can enforce a broad class of constraints, including nonlocal ones, by specifying the space of possible output structures as a regular language $\mathcal{L}$. The resulting regular-constrained CRF (RegCCRF) has the same formal properties as a standard CRF, but assigns zero probability to all label sequences not in $\mathcal{L}$. Notably, RegCCRFs can incorporate their constraints during training, while related models only enforce constraints during decoding. We prove that constrained training is never worse than constrained decoding, and show empirically that it can be substantially better in practice. Additionally, we demonstrate a practical benefit on downstream tasks by incorporating a RegCCRF into a deep neural model for semantic role labeling, exceeding state-of-the-art results on a standard dataset.  ( 2 min )
    Accelerating Neural ODEs Using Model Order Reduction. (arXiv:2105.14070v2 [cs.LG] UPDATED)
    Embedding nonlinear dynamical systems into artificial neural networks is a powerful new formalism for machine learning. By parameterizing ordinary differential equations (ODEs) as neural network layers, these Neural ODEs are memory-efficient to train, process time-series naturally and incorporate knowledge of physical systems into deep learning models. However, the practical applications of Neural ODEs are limited due to long inference times, because the outputs of the embedded ODE layers are computed numerically with differential equation solvers that can be computationally demanding. Here we show that mathematical model order reduction methods can be used for compressing and accelerating Neural ODEs by accurately simulating the continuous nonlinear dynamics in low-dimensional subspaces. We implement our novel compression method by developing Neural ODEs that integrate the necessary subspace-projection and interpolation operations as layers of the neural network. We validate our approach by comparing it to neuron pruning and SVD-based weight truncation methods from the literature in image and time-series classification tasks. The methods are evaluated by acceleration versus accuracy when adjusting the level of compression. On this spectrum, we achieve a favourable balance over existing methods by using model order reduction when compressing a convolutional Neural ODE. In compressing a recurrent Neural ODE, SVD-based weight truncation yields good performance. Based on our results, our integration of model order reduction with Neural ODEs can facilitate efficient, dynamical system-driven deep learning in resource-constrained applications.  ( 2 min )
    Sparse PointPillars: Maintaining and Exploiting Input Sparsity to Improve Runtime on Embedded Systems. (arXiv:2106.06882v3 [cs.CV] UPDATED)
    Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we create a fast, high-performance BEV 3D object detector that maintains and exploits this input sparsity to decrease runtimes over non-sparse baselines and avoids the tradeoff between pseudoimage area and runtime. We present results on KITTI, a canonical 3D detection dataset, and Matterport-Chair, a novel Matterport3D-derived chair detection dataset from scenes in real furnished homes. We evaluate runtime characteristics using a desktop GPU, an embedded ML accelerator, and a robot CPU, demonstrating that our method results in significant detection speedups (2X or more) for embedded systems with only a modest decrease in detection quality. Our work represents a new approach for practitioners to optimize models for embedded systems by maintaining and exploiting input sparsity throughout their entire pipeline to reduce runtime and resource usage while preserving detection performance.  ( 2 min )
    Principled Deep Neural Network Training through Linear Programming. (arXiv:1810.03218v3 [cs.LG] UPDATED)
    Deep learning has received much attention lately due to the impressive empirical performance achieved by training algorithms. Consequently, a need for a better theoretical understanding of these problems has become more evident in recent years. In this work, using a unified framework, we show that there exists a polyhedron which encodes simultaneously all possible deep neural network training problems that can arise from a given architecture, activation functions, loss function, and sample-size. Notably, the size of the polyhedral representation depends only linearly on the sample-size, and a better dependency on several other network parameters is unlikely (assuming $P\neq NP$). Additionally, we use our polyhedral representation to obtain new and better computational complexity results for training problems of well-known neural network architectures. Our results provide a new perspective on training problems through the lens of polyhedral theory and reveal a strong structure arising from these problems.  ( 2 min )
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v4 [cs.LG] UPDATED)
    This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v3 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    Online Competitive Influence Maximization. (arXiv:2006.13411v4 [cs.LG] UPDATED)
    Online influence maximization has attracted much attention as a way to maximize influence spread through a social network while learning the values of unknown network parameters. Most previous works focus on single-item diffusion. In this paper, we introduce a new Online Competitive Influence Maximization (OCIM) problem, where two competing items (e.g., products, news stories) propagate in the same network and influence probabilities on edges are unknown. We adopt a combinatorial multi-armed bandit (CMAB) framework for OCIM, but unlike the non-competitive setting, the important monotonicity property (influence spread increases when influence probabilities on edges increase) no longer holds due to the competitive nature of propagation, which brings a significant new challenge to the problem. We provide a nontrivial proof showing that the Triggering Probability Modulated (TPM) condition for CMAB still holds in OCIM, which is instrumental for our proposed algorithms OCIM-TS and OCIM-OFU to achieve sublinear Bayesian and frequentist regret, respectively. We also design an OCIM-ETC algorithm that requires less feedback and easier offline computation, at the expense of a worse frequentist regret bound. Experimental evaluations demonstrate the effectiveness of our algorithms.  ( 2 min )
    An Analysis of Ensemble Sampling. (arXiv:2203.01303v1 [cs.LG])
    Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable. In this paper, we establish a Bayesian regret bound that ensures desirable behavior when ensemble sampling is applied to the linear bandit problem. This represents the first rigorous regret analysis of ensemble sampling and is made possible by leveraging information-theoretic concepts and novel analytic techniques that may prove useful beyond the scope of this paper.  ( 2 min )
    Modeling Extremes with d-max-decreasing Neural Networks. (arXiv:2102.09042v2 [stat.ML] UPDATED)
    We propose a novel neural network architecture that enables non-parametric calibration and generation of multivariate extreme value distributions (MEVs). MEVs arise from Extreme Value Theory (EVT) as the necessary class of models when extrapolating a distributional fit over large spatial and temporal scales based on data observed in intermediate scales. In turn, EVT dictates that $d$-max-decreasing, a stronger form of convexity, is an essential shape constraint in the characterization of MEVs. As far as we know, our proposed architecture provides the first class of non-parametric estimators for MEVs that preserve these essential shape constraints. We show that our architecture approximates the dependence structure encoded by MEVs at parametric rate. Moreover, we present a new method for sampling high-dimensional MEVs using a generative model. We demonstrate our methodology on a wide range of experimental settings, ranging from environmental sciences to financial mathematics and verify that the structural properties of MEVs are retained compared to existing methods.  ( 2 min )
    Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection. (arXiv:2012.09214v3 [cs.CR] UPDATED)
    In this paper, we present a scientific evaluation of four market-leading malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or 28\% benign) of a variety of file types, including hundreds of malicious zero-days, polyglots, and APT-style files, delivered on multiple protocols. We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide two novel applications of the recent cost-benefit evaluation procedure of Iannacone \& Bridges. While the ML-based tools are more effective at detecting zero-day files and executables, the signature-based tool may still be an overall better option. Both network-based tools provide substantial (simulated) savings when paired with either host tool, yet both show poor detection rates on protocols other than HTTP or SMTP. Our results show that all four tools have near-perfect precision but alarmingly low recall, especially on file types other than executables and office files -- 37% of malware tested, including all polyglot files, were undetected. Priorities for researchers and takeaways for end users are given.  ( 2 min )
    A new perspective on low-rank optimization. (arXiv:2105.05947v2 [math.OC] UPDATED)
    A key question in many low-rank problems throughout optimization, machine learning, and statistics is to characterize the convex hulls of simple low-rank sets and judiciously apply these convex hulls to obtain strong yet computationally tractable convex relaxations. We invoke the matrix perspective function - the matrix analog of the perspective function - and characterize explicitly the convex hull of epigraphs of simple matrix convex functions under low-rank constraints. Further, we combine the matrix perspective function with orthogonal projection matrices-the matrix analog of binary variables which capture the row-space of a matrix-to develop a matrix perspective reformulation technique that reliably obtains strong relaxations for a variety of low-rank problems, including reduced rank regression, non-negative matrix factorization, and factor analysis. Moreover, we establish that these relaxations can be modeled via semidefinite constraints and thus optimized over tractably. The proposed approach parallels and generalizes the perspective reformulation technique in mixed-integer optimization and leads to new relaxations for a broad class of problems.  ( 2 min )
    Supervised Hebbian learning: toward eXplainable AI. (arXiv:2203.01304v1 [cond-mat.dis-nn])
    In neural network's Literature, {\em Hebbian learning} traditionally refers to the procedure by which the Hopfield model and its generalizations {\em store} archetypes (i.e., definite patterns that are experienced just once to form the synaptic matrix). However, the term {\em learning} in Machine Learning refers to the ability of the machine to extract features from the supplied dataset (e.g., made of blurred examples of these archetypes), in order to make its own representation of the unavailable archetypes. Here we prove that, if we feed the Hopfield model with blurred examples, we can define both {\em supervised} and {\em unsupervised} learning protocols by which the network can possibly infer the archetypes and we detect the correct control parameters (including the dataset size and its quality) to depict a phase diagram for the system performance. We also prove that, for random, structureless datasets, the Hopfield model equipped with a supervised learning rule is equivalent to a restricted Boltzmann machine and this suggests an optimal training routine; the robustness of results is also checked numerically for structured datasets. This work contributes to pave a solid way toward eXplainable AI (XAI).  ( 2 min )
    HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning. (arXiv:2203.01311v1 [cs.LG])
    Learning multimodal representations involves discovering correspondences and integrating information from multiple heterogeneous sources of data. While recent research has begun to explore the design of more general-purpose multimodal models (contrary to prior focus on domain and modality-specific architectures), these methods are still largely focused on a small set of modalities in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, we investigate methods for high-modality (a large set of diverse modalities) and partially-observable (each task only defined on a small subset of modalities) scenarios. To tackle these challenges, we design a general multimodal model that enables multitask and transfer learning: multitask learning with shared parameters enables stable parameter counts (addressing scalability), and cross-modal transfer learning enables information sharing across modalities and tasks (addressing partial observability). Our resulting model generalizes across text, image, video, audio, time-series, sensors, tables, and set modalities from different research areas, improves the tradeoff between performance and efficiency, transfers to new modalities and tasks, and reveals surprising insights on the nature of information sharing in multitask models. We release our code and benchmarks which we hope will present a unified platform for subsequent theoretical and empirical analysis: https://github.com/pliang279/HighMMT.  ( 2 min )
    Continual Learning via Bit-Level Information Preserving. (arXiv:2105.04444v3 [cs.LG] UPDATED)
    Continual learning tackles the setting of learning different tasks sequentially. Despite the lots of previous solutions, most of them still suffer significant forgetting or expensive memory cost. In this work, targeted at these problems, we first study the continual learning process through the lens of information theory and observe that forgetting of a model stems from the loss of \emph{information gain} on its parameters from the previous tasks when learning a new task. From this viewpoint, we then propose a novel continual learning approach called Bit-Level Information Preserving (BLIP) that preserves the information gain on model parameters through updating the parameters at the bit level, which can be conveniently implemented with parameter quantization. More specifically, BLIP first trains a neural network with weight quantization on the new incoming task and then estimates information gain on each parameter provided by the task data to determine the bits to be frozen to prevent forgetting. We conduct extensive experiments ranging from classification tasks to reinforcement learning tasks, and the results show that our method produces better or on par results comparing to previous state-of-the-arts. Indeed, BLIP achieves close to zero forgetting while only requiring constant memory overheads throughout continual learning.  ( 2 min )
    STEADY: Simultaneous State Estimation and Dynamics Learning from Indirect Observations. (arXiv:2203.01299v1 [cs.RO])
    Accurate kinodynamic models play a crucial role in many robotics applications such as off-road navigation and high-speed driving. Many state-of-the-art approaches in learning stochastic kinodynamic models, however, require precise measurements of robot states as labeled input/output examples, which can be hard to obtain in outdoor settings due to limited sensor capabilities and the absence of ground truth. In this work, we propose a new technique for learning neural stochastic kinodynamic models from noisy and indirect observations by performing simultaneous state estimation and dynamics learning. The proposed technique iteratively improves the kinodynamic model in an expectation-maximization loop, where the E Step samples posterior state trajectories using particle filtering, and the M Step updates the dynamics to be more consistent with the sampled trajectories via stochastic gradient ascent. We evaluate our approach on both simulation and real-world benchmarks and compare it with several baseline techniques. Our approach not only achieves significantly higher accuracy but is also more robust to observation noise, thereby showing promise for boosting the performance of many other robotics applications.  ( 2 min )
    Adaptive Granularity in Tensors: A Quest for Interpretable Structure. (arXiv:1912.09009v2 [cs.LG] UPDATED)
    Data collected at very frequent intervals is usually extremely sparse and has no structure that is exploitable by modern tensor decomposition algorithms. Thus the utility of such tensors is low, in terms of the amount of interpretable and exploitable structure that one can extract from them. In this paper, we introduce the problem of finding a tensor of adaptive aggregated granularity that can be decomposed to reveal meaningful latent concepts (structures) from datasets that, in their original form, are not amenable to tensor analysis. Such datasets fall under the broad category of sparse point processes that evolve over space and/or time. To the best of our knowledge, this is the first work that explores adaptive granularity aggregation in tensors. Furthermore, we formally define the problem and discuss what different definitions of "good structure" can be in practice, and show that optimal solution is of prohibitive combinatorial complexity. Subsequently, we propose an efficient and effective greedy algorithm called IceBreaker, which follows a number of intuitive decision criteria that locally maximize the "goodness of structure", resulting in high-quality tensors. We evaluate our method on synthetic, semi-synthetic and real datasets. In all the cases, our proposed method constructs tensors that have very high structure quality.  ( 2 min )
    Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. (arXiv:2104.09311v2 [math.OC] UPDATED)
    We study finite-time horizon continuous-time linear-convex reinforcement learning problems in an episodic setting. In this problem, the unknown linear jump-diffusion process is controlled subject to nonsmooth convex costs. We show that the associated linear-convex control problems admit Lipchitz continuous optimal feedback controls and further prove the Lipschitz stability of the feedback controls, i.e., the performance gap between applying feedback controls for an incorrect model and for the true model depends Lipschitz-continuously on the magnitude of perturbations in the model coefficients; the proof relies on a stability analysis of the associated forward-backward stochastic differential equation. We then propose a novel least-squares algorithm which achieves a regret of the order $O(\sqrt{N\ln N})$ on linear-convex learning problems with jumps, where $N$ is the number of learning episodes; the analysis leverages the Lipschitz stability of feedback controls and concentration properties of sub-Weibull random variables. Numerical experiment confirms the convergence and the robustness of the proposed algorithm.  ( 2 min )
    FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance. (arXiv:2011.09607v2 [q-fin.TR] UPDATED)
    As deep reinforcement learning (DRL) has been recognized as an effective approach in quantitative finance, getting hands-on experiences is attractive to beginners. However, to train a practical DRL trading agent that decides where to trade, at what price, and what quantity involves error-prone and arduous development and debugging. In this paper, we introduce a DRL library FinRL that facilitates beginners to expose themselves to quantitative finance and to develop their own stock trading strategies. Along with easily-reproducible tutorials, FinRL library allows users to streamline their own developments and to compare with existing schemes easily. Within FinRL, virtual environments are configured with stock market datasets, trading agents are trained with neural networks, and extensive backtesting is analyzed via trading performance. Moreover, it incorporates important trading constraints such as transaction cost, market liquidity and the investor's degree of risk-aversion. FinRL is featured with completeness, hands-on tutorial and reproducibility that favors beginners: (i) at multiple levels of time granularity, FinRL simulates trading environments across various stock markets, including NASDAQ-100, DJIA, S&P 500, HSI, SSE 50, and CSI 300; (ii) organized in a layered architecture with modular structure, FinRL provides fine-tuned state-of-the-art DRL algorithms (DQN, DDPG, PPO, SAC, A2C, TD3, etc.), commonly-used reward functions and standard evaluation baselines to alleviate the debugging workloads and promote the reproducibility, and (iii) being highly extendable, FinRL reserves a complete set of user-import interfaces. Furthermore, we incorporated three application demonstrations, namely single stock trading, multiple stock trading, and portfolio allocation. The FinRL library will be available on Github at link https://github.com/AI4Finance-LLC/FinRL-Library.  ( 3 min )
    Debiased-CAM to mitigate image perturbations with faithful visual explanations of machine learning. (arXiv:2012.05567v2 [cs.CV] UPDATED)
    Model explanations such as saliency maps can improve user trust in AI by highlighting important features for a prediction. However, these become distorted and misleading when explaining predictions of images that are subject to systematic error (bias) by perturbations and corruptions. Furthermore, the distortions persist despite model fine-tuning on images biased by different factors (blur, color temperature, day/night). We present Debiased-CAM to recover explanation faithfulness across various bias types and levels by training a multi-input, multi-task model with auxiliary tasks for explanation and bias level predictions. In simulation studies, the approach not only enhanced prediction accuracy, but also generated highly faithful explanations about these predictions as if the images were unbiased. In user studies, debiased explanations improved user task performance, perceived truthfulness and perceived helpfulness. Debiased training can provide a versatile platform for robust performance and explanation faithfulness for a wide range of applications with data biases.  ( 2 min )
    Incentivizing Exploration with Selective Data Disclosure. (arXiv:1811.06026v5 [cs.GT] UPDATED)
    We study the design of rating systems that incentivize efficient social learning. Agents arrive sequentially and choose actions, each of which yields a reward drawn from an unknown distribution. A policy maps the rewards of previously-chosen actions to messages for arriving agents. The regret of a policy is the difference, over all rounds, between the expected reward of the best action and the reward induced by the policy. Prior work proposes policies that recommend a single action to each agent, obtaining optimal regret under standard rationality assumptions. We instead assume a frequentist behavioral model and, accordingly, restrict attention to disclosure policies that use messages consisting of the actions and rewards from a subsequence of past agents, chosen ex ante. We design a policy with optimal regret in the worst case over reward distributions. Our research suggests three components of effective polices: independent focus groups, group aggregators, and interlaced information structures.  ( 2 min )
    Sensor-Based Estimation of Dim Light Melatonin Onset (DLMO) Using Features of Two Time Scales. (arXiv:1908.07483v5 [cs.LG] UPDATED)
    Circadian rhythms influence multiple essential biological activities including sleep, performance, and mood. The dim light melatonin onset (DLMO) is the gold standard for measuring human circadian phase (i.e., timing). The collection of DLMO is expensive and time-consuming since multiple saliva or blood samples are required overnight in special conditions, and the samples must then be assayed for melatonin. Recently, several computational approaches have been designed for estimating DLMO. These methods collect daily sampled data (e.g., sleep onset/offset times) or frequently sampled data (e.g., light exposure/skin temperature/physical activity collected every minute) to train learning models for estimating DLMO. One limitation of these studies is that they only leverage one time-scale data. We propose a two-step framework for estimating DLMO using data from both time scales. The first step summarizes data from before the current day, while the second step combines this summary with frequently sampled data of the current day. We evaluate three moving average models that input sleep timing data as the first step and use recurrent neural network models as the second step. The results using data from 207 undergraduates show that our two-step model with two time-scale features has statistically significantly lower root-mean-square errors than models that use either daily sampled data or frequently sampled data.  ( 2 min )
    Evolving Curricula with Regret-Based Environment Design. (arXiv:2203.01302v1 [cs.LG])
    It remains a significant challenge to train generally capable agents with reinforcement learning (RL). A promising avenue for improving the robustness of RL agents is through the use of curricula. One such class of methods frames environment design as a game between a student and a teacher, using regret-based objectives to produce environment instantiations (or levels) at the frontier of the student agent's capabilities. These methods benefit from their generality, with theoretical guarantees at equilibrium, yet they often struggle to find effective levels in challenging design spaces. By contrast, evolutionary approaches seek to incrementally alter environment complexity, resulting in potentially open-ended learning, but often rely on domain-specific heuristics and vast amounts of computational resources. In this paper we propose to harness the power of evolution in a principled, regret-based curriculum. Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), seeks to constantly produce levels at the frontier of an agent's capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior regret-based methods, while providing significant empirical gains in a diverse set of environments. An interactive version of the paper is available at accelagent.github.io.  ( 2 min )
    Providing Insights for Open-Response Surveys via End-to-End Context-Aware Clustering. (arXiv:2203.01294v1 [cs.LG])
    Teachers often conduct surveys in order to collect data from a predefined group of students to gain insights into topics of interest. When analyzing surveys with open-ended textual responses, it is extremely time-consuming, labor-intensive, and difficult to manually process all the responses into an insightful and comprehensive report. In the analysis step, traditionally, the teacher has to read each of the responses and decide on how to group them in order to extract insightful information. Even though it is possible to group the responses only using certain keywords, such an approach would be limited since it not only fails to account for embedded contexts but also cannot detect polysemous words or phrases and semantics that are not expressible in single words. In this work, we present a novel end-to-end context-aware framework that extracts, aggregates, and abbreviates embedded semantic patterns in open-response survey data. Our framework relies on a pre-trained natural language model in order to encode the textual data into semantic vectors. The encoded vectors then get clustered either into an optimally tuned number of groups or into a set of groups with pre-specified titles. In the former case, the clusters are then further analyzed to extract a representative set of keywords or summary sentences that serve as the labels of the clusters. In our framework, for the designated clusters, we finally provide context-aware wordclouds that demonstrate the semantically prominent keywords within each group. Honoring user privacy, we have successfully built the on-device implementation of our framework suitable for real-time analysis on mobile devices and have tested it on a synthetic dataset. Our framework reduces the costs at-scale by automating the process of extracting the most insightful information pieces from survey data.  ( 2 min )
    Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits. (arXiv:1807.07623v6 [cs.LG] UPDATED)
    We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.  ( 2 min )
    ADVISE: ADaptive Feature Relevance and VISual Explanations for Convolutional Neural Networks. (arXiv:2203.01289v1 [cs.CV])
    To equip Convolutional Neural Networks (CNNs) with explainability, it is essential to interpret how opaque models take specific decisions, understand what causes the errors, improve the architecture design, and identify unethical biases in the classifiers. This paper introduces ADVISE, a new explainability method that quantifies and leverages the relevance of each unit of the feature map to provide better visual explanations. To this end, we propose using adaptive bandwidth kernel density estimation to assign a relevance score to each unit of the feature map with respect to the predicted class. We also propose an evaluation protocol to quantitatively assess the visual explainability of CNN models. We extensively evaluate our idea in the image classification task using AlexNet, VGG16, ResNet50, and Xception pretrained on ImageNet. We compare ADVISE with the state-of-the-art visual explainable methods and show that the proposed method outperforms competing approaches in quantifying feature-relevance and visual explainability while maintaining competitive time complexity. Our experiments further show that ADVISE fulfils the sensitivity and implementation independence axioms while passing the sanity checks. The implementation is accessible for reproducibility purposes on https://github.com/dehshibi/ADVISE.  ( 2 min )
    Deep Temporal Interpolation of Radar-based Precipitation. (arXiv:2203.01277v1 [cs.CV])
    When providing the boundary conditions for hydrological flood models and estimating the associated risk, interpolating precipitation at very high temporal resolutions (e.g. 5 minutes) is essential not to miss the cause of flooding in local regions. In this paper, we study optical flow-based interpolation of globally available weather radar images from satellites. The proposed approach uses deep neural networks for the interpolation of multiple video frames, while terrain information is combined with temporarily coarse-grained precipitation radar observation as inputs for self-supervised training. An experiment with the Meteonet radar precipitation dataset for the flood risk simulation in Aude, a department in Southern France (2018), demonstrated the advantage of the proposed method over a linear interpolation baseline, with up to 20% error reduction.  ( 2 min )
    Machine learning models predict calculation outcomes with the transferability necessary for computational catalysis. (arXiv:2203.01276v1 [physics.chem-ph])
    Virtual high throughput screening (VHTS) and machine learning (ML) have greatly accelerated the design of single-site transition-metal catalysts. VHTS of catalysts, however, is often accompanied with high calculation failure rate and wasted computational resources due to the difficulty of simultaneously converging all mechanistically relevant reactive intermediates to expected geometries and electronic states. We demonstrate a dynamic classifier approach, i.e., a convolutional neural network that monitors geometry optimization on the fly, and exploit its good performance and transferability for catalyst design. We show that the dynamic classifier performs well on all reactive intermediates in the representative catalytic cycle of the radical rebound mechanism for methane-to-methanol despite being trained on only one reactive intermediate. The dynamic classifier also generalizes to chemically distinct intermediates and metal centers absent from the training data without loss of accuracy or model confidence. We rationalize this superior model transferability to the use of on-the-fly electronic structure and geometric information generated from density functional theory calculations and the convolutional layer in the dynamic classifier. Combined with model uncertainty quantification, the dynamic classifier saves more than half of the computational resources that would have been wasted on unsuccessful calculations for all reactive intermediates being considered.  ( 2 min )
    On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features. (arXiv:2203.01238v1 [cs.LG])
    When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. This phenomenon is called Neural Collapse (NC), which seems to take place regardless of the choice of loss functions. In this work, we justify NC under the mean squared error (MSE) loss, where recent empirical evidence shows that it performs comparably or even better than the de-facto cross-entropy loss. Under a simplified unconstrained feature model, we provide the first global landscape analysis for vanilla nonconvex MSE loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Furthermore, we justify the usage of rescaled MSE loss by probing the optimization landscape around the NC solutions, showing that the landscape can be improved by tuning the rescaling hyperparameters. Finally, our theoretical findings are experimentally verified on practical network architectures.  ( 2 min )
    Low-Degree Multicalibration. (arXiv:2203.01255v1 [cs.LG])
    Introduced as a notion of algorithmic fairness, multicalibration has proved to be a powerful and versatile concept with implications far beyond its original intent. This stringent notion -- that predictions be well-calibrated across a rich class of intersecting subpopulations -- provides its strong guarantees at a cost: the computational and sample complexity of learning multicalibrated predictors are high, and grow exponentially with the number of class labels. In contrast, the relaxed notion of multiaccuracy can be achieved more efficiently, yet many of the most desirable properties of multicalibration cannot be guaranteed assuming multiaccuracy alone. This tension raises a key question: Can we learn predictors with multicalibration-style guarantees at a cost commensurate with multiaccuracy? In this work, we define and initiate the study of Low-Degree Multicalibration. Low-Degree Multicalibration defines a hierarchy of increasingly-powerful multi-group fairness notions that spans multiaccuracy and the original formulation of multicalibration at the extremes. Our main technical contribution demonstrates that key properties of multicalibration, related to fairness and accuracy, actually manifest as low-degree properties. Importantly, we show that low-degree multicalibration can be significantly more efficient than full multicalibration. In the multi-class setting, the sample complexity to achieve low-degree multicalibration improves exponentially (in the number of classes) over full multicalibration. Our work presents compelling evidence that low-degree multicalibration represents a sweet spot, pairing computational and sample efficiency with strong fairness and accuracy guarantees.  ( 2 min )
    Convolutional neural networks as an alternative to Bayesian retrievals. (arXiv:2203.01236v1 [astro-ph.EP])
    Exoplanet observations are currently analysed with Bayesian retrieval techniques. Due to the computational load of the models used, a compromise is needed between model complexity and computing time. Analysis of data from future facilities, will need more complex models which will increase the computational load of retrievals, prompting the search for a faster approach for interpreting exoplanet observations. Our goal is to compare machine learning retrievals of exoplanet transmission spectra with nested sampling, and understand if machine learning can be as reliable as Bayesian retrievals for a statistically significant sample of spectra while being orders of magnitude faster. We generate grids of synthetic transmission spectra and their corresponding planetary and atmospheric parameters, one using free chemistry models, and the other using equilibrium chemistry models. Each grid is subsequently rebinned to simulate both HST/WFC3 and JWST/NIRSpec observations, yielding four datasets in total. Convolutional neural networks (CNNs) are trained with each of the datasets. We perform retrievals on a 1,000 simulated observations for each combination of model type and instrument with nested sampling and machine learning. We also use both methods to perform retrievals on real WFC3 transmission spectra. Finally, we test how robust machine learning and nested sampling are against incorrect assumptions in our models. CNNs reach a lower coefficient of determination between predicted and true values of the parameters. Nested sampling underestimates the uncertainty in ~8% of retrievals, whereas CNNs estimate them correctly. For real WFC3 observations, nested sampling and machine learning agree within $2\sigma$ for ~86% of spectra. When doing retrievals with incorrect assumptions, nested sampling underestimates the uncertainty in ~12% to ~41% of cases, whereas this is always below ~10% for the CNN.  ( 2 min )
    TAE: A Semi-supervised Controllable Behavior-aware Trajectory Generator and Predictor. (arXiv:2203.01261v1 [cs.RO])
    Trajectory generation and prediction are two interwoven tasks that play important roles in planner evaluation and decision making for intelligent vehicles. Most existing methods focus on one of the two and are optimized to directly output the final generated/predicted trajectories, which only contain limited information for critical scenario augmentation and safe planning. In this work, we propose a novel behavior-aware Trajectory Autoencoder (TAE) that explicitly models drivers' behavior such as aggressiveness and intention in the latent space, using semi-supervised adversarial autoencoder and domain knowledge in transportation. Our model addresses trajectory generation and prediction in a unified architecture and benefits both tasks: the model can generate diverse, controllable and realistic trajectories to enhance planner optimization in safety-critical and long-tailed scenarios, and it can provide prediction of critical behavior in addition to the final trajectories for decision making. Experimental results demonstrate that our method achieves promising performance on both trajectory generation and prediction.  ( 2 min )
    Interactive Visualization of Protein RINs using NetworKit in the Cloud. (arXiv:2203.01263v1 [cs.SI])
    Network analysis has been applied in diverse application domains. In this paper, we consider an example from protein dynamics, specifically residue interaction networks (RINs). In this context, we use NetworKit -- an established package for network analysis -- to build a cloud-based environment that enables domain scientists to run their visualization and analysis workflows on large compute servers, without requiring extensive programming and/or system administration knowledge. To demonstrate the versatility of this approach, we use it to build a custom Jupyter-based widget for RIN visualization. In contrast to existing RIN visualization approaches, our widget can easily be customized through simple modifications of Python code, while both supporting a good feature set and providing near real-time speed. It is also easily integrated into analysis pipelines (e.g., that use Python to feed RIN data into downstream machine learning tasks).  ( 2 min )
    Flow-based density of states for complex actions. (arXiv:2203.01243v1 [hep-lat])
    Emerging sampling algorithms based on normalizing flows have the potential to solve ergodicity problems in lattice calculations. Furthermore, it has been noted that flows can be used to compute thermodynamic quantities which are difficult to access with traditional methods. This suggests that they are also applicable to the density-of-states approach to complex action problems. In particular, flow-based sampling may be used to compute the density directly, in contradistinction to the conventional strategy of reconstructing it via measuring and integrating the derivative of its logarithm. By circumventing this procedure, the accumulation of errors from the numerical integration is avoided completely and the overall normalization factor can be determined explicitly. In this proof-of-principle study, we demonstrate our method in the context of two-component scalar field theory where the $O(2)$ symmetry is explicitly broken by an imaginary external field. First, we concentrate on the zero-dimensional case which can be solved exactly. We show that with our method, the Lee-Yang zeroes of the associated partition function can be successfully located. Subsequently, we confirm that the flow-based approach correctly reproduces the density computed with conventional methods in one- and two-dimensional models.  ( 2 min )
    Model-free Neural Lyapunov Control for Safe Robot Navigation. (arXiv:2203.01190v1 [cs.RO])
    Model-free Deep Reinforcement Learning (DRL) controllers have demonstrated promising results on various challenging non-linear control tasks. While a model-free DRL algorithm can solve unknown dynamics and high-dimensional problems, it lacks safety assurance. Although safety constraints can be encoded as part of a reward function, there still exists a large gap between an RL controller trained with this modified reward and a safe controller. In contrast, instead of implicitly encoding safety constraints with rewards, we explicitly co-learn a Twin Neural Lyapunov Function (TNLF) with the control policy in the DRL training loop and use the learned TNLF to build a runtime monitor. Combined with the path generated from a planner, the monitor chooses appropriate waypoints that guide the learned controller to provide collision-free control trajectories. Our approach inherits the scalability advantages from DRL while enhancing safety guarantees. Our experimental evaluation demonstrates the effectiveness of our approach compared to DRL with augmented rewards and constrained DRL methods over a range of high-dimensional safety-sensitive navigation tasks.  ( 2 min )
    Learning Conditional Variational Autoencoders with Missing Covariates. (arXiv:2203.01218v1 [stat.ML])
    Conditional variational autoencoders (CVAEs) are versatile deep generative models that extend the standard VAE framework by conditioning the generative model with auxiliary covariates. The original CVAE model assumes that the data samples are independent, whereas more recent conditional VAE models, such as the Gaussian process (GP) prior VAEs, can account for complex correlation structures across all data samples. While several methods have been proposed to learn standard VAEs from partially observed datasets, these methods fall short for conditional VAEs. In this work, we propose a method to learn conditional VAEs from datasets in which auxiliary covariates can contain missing values as well. The proposed method augments the conditional VAEs with a prior distribution for the missing covariates and estimates their posterior using amortised variational inference. At training time, our method marginalises the uncertainty associated with the missing covariates while simultaneously maximising the evidence lower bound. We develop computationally efficient methods to learn CVAEs and GP prior VAEs that are compatible with mini-batching. Our experiments on simulated datasets as well as on a clinical trial study show that the proposed method outperforms previous methods in learning conditional VAEs from non-temporal, temporal, and longitudinal datasets.  ( 2 min )
    Applying multi-angled parallelism to Spanish topographical maps. (arXiv:2203.01169v1 [cs.CV])
    Multi-Angled Parallelism (MAP) is a method to recognize lines in binary images. It is suitable to be implemented in parallel processing and image processing hardware. The binary image is transformed into directional planes, upon which, directional operators of erosion-dilation are iteratively applyed. From a set of basic operators, more complex ones are created, which let to extract the several types of lines. Each type is extracted with a different set of operations and so the lines are identified when extracted. In this paper, an overview of MAP is made, and it is adapted to line recognition in Spanish topographical maps, with the double purpose of testing the method in a real case and studying the process of adapting it to a custom application.  ( 2 min )
    Hybrid Model-based / Data-driven Graph Transform for Image Coding. (arXiv:2203.01186v1 [eess.IV])
    Transform coding to sparsify signal representations remains crucial in an image compression pipeline. While the Karhunen-Lo\`{e}ve transform (KLT) computed from an empirical covariance matrix $\bar{C}$ is theoretically optimal for a stationary process, in practice, collecting sufficient statistics from a non-stationary image to reliably estimate $\bar{C}$ can be difficult. In this paper, to encode an intra-prediction residual block, we pursue a hybrid model-based / data-driven approach: the first $K$ eigenvectors of a transform matrix are derived from a statistical model, e.g., the asymmetric discrete sine transform (ADST), for stability, while the remaining $N-K$ are computed from $\bar{C}$ for performance. The transform computation is posed as a graph learning problem, where we seek a graph Laplacian matrix minimizing a graphical lasso objective inside a convex cone sharing the first $K$ eigenvectors in a Hilbert space of real symmetric matrices. We efficiently solve the problem via augmented Lagrangian relaxation and proximal gradient (PG). Using WebP as a baseline image codec, experimental results show that our hybrid graph transform achieved better energy compaction than default discrete cosine transform (DCT) and better stability than KLT.  ( 2 min )
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v1 [stat.ML])
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient subgroups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables medical practitioners to develop effective treatment recommendations tailored to patient subgroups.  ( 2 min )
    Linear Stochastic Bandits over a Bit-Constrained Channel. (arXiv:2203.01198v1 [cs.LG])
    One of the primary challenges in large-scale distributed learning stems from stringent communication constraints. While several recent works address this challenge for static optimization problems, sequential decision-making under uncertainty has remained much less explored in this regard. Motivated by this gap, we introduce a new linear stochastic bandit formulation over a bit-constrained channel. Specifically, in our setup, an agent interacting with an environment transmits encoded estimates of an unknown model parameter to a server over a communication channel of finite capacity. The goal of the server is to take actions based on these estimates to minimize cumulative regret. To this end, we develop a novel and general algorithmic framework that hinges on two main components: (i) an adaptive encoding mechanism that exploits statistical concentration bounds, and (ii) a decision-making principle based on confidence sets that account for encoding errors. As our main result, we prove that when the unknown model is $d$-dimensional, a channel capacity of $O(d)$ bits suffices to achieve order-optimal regret. To demonstrate the generality of our approach, we then show that the same result continues to hold for non-linear observation models satisfying standard regularity conditions. Finally, we establish that for the simpler unstructured multi-armed bandit problem, $1$ bit channel-capacity is sufficient for achieving optimal regret bounds. Overall, our work takes a significant first step towards paving the way for statistical decision-making over finite-capacity channels.  ( 2 min )
    Discrete Optimal Transport with Independent Marginals is #P-Hard. (arXiv:2203.01161v1 [math.OC])
    We study the computational complexity of the optimal transport problem that evaluates the Wasserstein distance between the distributions of two K-dimensional discrete random vectors. The best known algorithms for this problem run in polynomial time in the maximum of the number of atoms of the two distributions. However, if the components of either random vector are independent, then this number can be exponential in K even though the size of the problem description scales linearly with K. We prove that the described optimal transport problem is #P-hard even if all components of the first random vector are independent uniform Bernoulli random variables, while the second random vector has merely two atoms, and even if only approximate solutions are sought. We also develop a dynamic programming-type algorithm that approximates the Wasserstein distance in pseudo-polynomial time when the components of the first random vector follow arbitrary independent discrete distributions, and we identify special problem instances that can be solved exactly in strongly polynomial time.  ( 2 min )
    Rethinking Pretraining as a Bridge from ANNs to SNNs. (arXiv:2203.01158v1 [cs.NE])
    Spiking neural networks (SNNs) are known as a typical kind of brain-inspired models with their unique features of rich neuronal dynamics, diverse coding schemes and low power consumption properties. How to obtain a high-accuracy model has always been the main challenge in the field of SNN. Currently, there are two mainstream methods, i.e., obtaining a converted SNN through converting a well-trained Artificial Neural Network (ANN) to its SNN counterpart or training an SNN directly. However, the inference time of a converted SNN is too long, while SNN training is generally very costly and inefficient. In this work, a new SNN training paradigm is proposed by combining the concepts of the two different training methods with the help of the pretrain technique and BP-based deep SNN training mechanism. We believe that the proposed paradigm is a more efficient pipeline for training SNNs. The pipeline includes pipeS for static data transfer tasks and pipeD for dynamic data transfer tasks. SOTA results are obtained in a large-scale event-driven dataset ES-ImageNet. For training acceleration, we achieve the same (or higher) best accuracy as similar LIF-SNNs using 1/10 training time on ImageNet-1K and 2/5 training time on ES-ImageNet and also provide a time-accuracy benchmark for a new dataset ES-UCF101. These experimental results reveal the similarity of the functions of parameters between ANNs and SNNs and also demonstrate the various potential applications of this SNN training pipeline.  ( 2 min )
    On the application of generative adversarial networks for nonlinear modal analysis. (arXiv:2203.01229v1 [cs.LG])
    Linear modal analysis is a useful and effective tool for the design and analysis of structures. However, a comprehensive basis for nonlinear modal analysis remains to be developed. In the current work, a machine learning scheme is proposed with a view to performing nonlinear modal analysis. The scheme is focussed on defining a one-to-one mapping from a latent `modal' space to the natural coordinate space, whilst also imposing orthogonality of the mode shapes. The mapping is achieved via the use of the recently-developed cycle-consistent generative adversarial network (cycle-GAN) and an assembly of neural networks targeted on maintaining the desired orthogonality. The method is tested on simulated data from structures with cubic nonlinearities and different numbers of degrees of freedom, and also on data from an experimental three-degree-of-freedom set-up with a column-bumper nonlinearity. The results reveal the method's efficiency in separating the `modes'. The method also provides a nonlinear superposition function, which in most cases has very good accuracy.  ( 2 min )
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v1 [math.OC])
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate against the best stabilizing linear controller. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.  ( 2 min )
    Speaker recognition improvement using blind inversion of distortions. (arXiv:2203.01164v1 [cs.SD])
    In this paper we propose the inversion of nonlinear distortions in order to improve the recognition rates of a speaker recognizer system. We study the effect of saturations on the test signals, trying to take into account real situations where the training material has been recorded in a controlled situation but the testing signals present some mismatch with the input signal level (saturations). The experimental results shows that a combination of data fusion with and without nonlinear distortion compensation can improve the recognition rates with saturated test sentences from 80% to 88.57%, while the results with clean speech (without saturation) is 87.76% for one microphone.  ( 2 min )
    Towards Efficient and Stable K-Asynchronous Federated Learning with Unbounded Stale Gradients on Non-IID Data. (arXiv:2203.01214v1 [cs.LG])
    Federated learning (FL) is an emerging privacy-preserving paradigm that enables multiple participants collaboratively to train a global model without uploading raw data. Considering heterogeneous computing and communication capabilities of different participants, asynchronous FL can avoid the stragglers effect in synchronous FL and adapts to scenarios with vast participants. Both staleness and non-IID data in asynchronous FL would reduce the model utility. However, there exists an inherent contradiction between the solutions to the two problems. That is, mitigating the staleness requires to select less but consistent gradients while coping with non-IID data demands more comprehensive gradients. To address the dilemma, this paper proposes a two-stage weighted $K$ asynchronous FL with adaptive learning rate (WKAFL). By selecting consistent gradients and adjusting learning rate adaptively, WKAFL utilizes stale gradients and mitigates the impact of non-IID data, which can achieve multifaceted enhancement in training speed, prediction accuracy and training stability. We also present the convergence analysis for WKAFL under the assumption of unbounded staleness to understand the impact of staleness and non-IID data. Experiments implemented on both benchmark and synthetic FL datasets show that WKAFL has better overall performance compared to existing algorithms.  ( 2 min )
    A Quantitative Geometric Approach to Neural Network Smoothness. (arXiv:2203.01212v1 [cs.LG])
    Fast and precise Lipschitz constant estimation of neural networks is an important task for deep learning. Researchers have recently found an intrinsic trade-off between the accuracy and smoothness of neural networks, so training a network with a loose Lipschitz constant estimation imposes a strong regularization and can hurt the model accuracy significantly. In this work, we provide a unified theoretical framework, a quantitative geometric approach, to address the Lipschitz constant estimation. By adopting this framework, we can immediately obtain several theoretical results, including the computational hardness of Lipschitz constant estimation and its approximability. Furthermore, the quantitative geometric perspective can also provide some insights into recent empirical observations that techniques for one norm do not usually transfer to another one. We also implement the algorithms induced from this quantitative geometric approach in a tool GeoLIP. These algorithms are based on semidefinite programming (SDP). Our empirical evaluation demonstrates that GeoLIP is more scalable and precise than existing tools on Lipschitz constant estimation for $\ell_\infty$-perturbations. Furthermore, we also show its intricate relations with other recent SDP-based techniques, both theoretically and empirically. We believe that this unified quantitative geometric perspective can bring new insights and theoretical tools to the investigation of neural-network smoothness and robustness.  ( 2 min )
    Are Latent Factor Regression and Sparse Regression Adequate?. (arXiv:2203.01219v1 [stat.ME])
    We propose the Factor Augmented sparse linear Regression Model (FARM) that not only encompasses both the latent factor regression and sparse linear regression as special cases but also bridges dimension reduction and sparse regression together. We provide theoretical guarantees for the estimation of our model under the existence of sub-Gaussian and heavy-tailed noises (with bounded (1+x)-th moment, for all x>0), respectively. In addition, the existing works on supervised learning often assume the latent factor regression or the sparse linear regression is the true underlying model without justifying its adequacy. To fill in such an important gap, we also leverage our model as the alternative model to test the sufficiency of the latent factor regression and the sparse linear regression models. To accomplish these goals, we propose the Factor-Adjusted de-Biased Test (FabTest) and a two-stage ANOVA type test respectively. We also conduct large-scale numerical experiments including both synthetic and FRED macroeconomics data to corroborate the theoretical properties of our methods. Numerical results illustrate the robustness and effectiveness of our model against latent factor regression and sparse linear regression models.  ( 2 min )
    A simple and universal rotation equivariant point-cloud network. (arXiv:2203.01216v1 [cs.LG])
    Equivariance to permutations and rigid motions is an important inductive bias for various 3D learning problems. Recently it has been shown that the equivariant Tensor Field Network architecture is universal -- it can approximate any equivariant function. In this paper we suggest a much simpler architecture, prove that it enjoys the same universality guarantees and evaluate its performance on Modelnet40. The code to reproduce our experiments is available at \url{https://github.com/simpleinvariance/UniversalNetwork}  ( 2 min )
    DCT-Former: Efficient Self-Attention withDiscrete Cosine Transform. (arXiv:2203.01178v1 [cs.LG])
    Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as $O(n^2)$ where $n$ stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of \textit{lossy} data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public  ( 2 min )

  • Open

    Lessons Learned on Language Model Safety and Misuse
    The deployment of powerful AI systems has enriched our understanding of safety and misuse far more than would have been possible through research alone. Notably: API-based language model misuse often comes in different forms than we feared most. We have identified limitations in existing language model evaluations that we are  ( 9 min )
    Economic Impacts Research at OpenAI
    Call for expressions of interest to study the economic impacts of Codex.  ( 4 min )
  • Open

    Experiment tracking with reinforcement learning for robotics.
    I know there are many great experiment tracking tools for ML i.e. weights and biases, or clearML, but has anybody got any suggestions for the best way to track reinforcement learning experiments. Particularly I'm interested in RL for robotics where I regularly change the environment In an attempt to coax the desired behavior out of the robot. submitted by /u/sanddollarbeach [link] [comments]  ( 1 min )
    "Affordance Learning from Play for Sample-Efficient Policy Learning", Borja-Diaz et al 2022
    submitted by /u/gwern [link] [comments]
    Using episode returns as a metric to decide hyperparameters/weights in the loss function
    Is it cheating or peeking the answers to use evaluation episode returns to decide which hyperparameters/weights to be used? I am now working on using different weights on one term of my loss function, however, Q-value estimate is not suitable for me to decide which weight to be used, I have tried others, e.g. actor loss, critic loss, etc. The only metric I could trust is the real episode returns of these loss function s with varied weights. submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
    Uses of RL and Research in RL
    I wanted to ask, how hot reinforcement learning is in terms of research? what kind of problems does the field have currently? I'm looking for a research topic and I became interested in RL, but none of my professors seems to research RL. [untill now I was working on privacy-preserving ML in CNNs] can it be used to solve relevant problems? what are some of the challenges that needs to be solved in RL ? thank you! submitted by /u/Elegant-Base-2339 [link] [comments]  ( 1 min )
  • Open

    [R] RNN with instance segmentation
    Hi, I'm working on a MSc thesis for object detection and I want to ask if exists an equivalent of Mask R-CNN but that uses RNN. submitted by /u/Comar997 [link] [comments]
    [P] ML tools for single-cell genomics research
    https://www.immunai.com/press/advancements-in-gene-regulatory-networks-immunais-third-symposium submitted by /u/pahita [link] [comments]  ( 1 min )
    [N] Upcoming talks for Laplace’s Causal Demon: A Webinar Series about Bayesian Inference and Causal Inference (week of 7 Mar.)
    We have two talks next week (week of 7 Mar.) for our upcoming webinar series about the intersection of Bayesian inference and causal inference. Our speakers will help us understand how we can use these two frameworks in order to solve applied problems, and will consider if these different frameworks are in conflict or are complimentary. The intended audience is machine learning practitioners and statisticians from academia and industry. ​ Upcoming talks, with Zoom registration links: 7 March Andrew Gelman - Bayesian Methods in Causal Inference and Decision Making Consider the problem of A/B testing (that is, an experiment or observational study designed to estimate the effect of some exposure or treatment). The basic data analysis workflow is to start by comparing the average outcomes…  ( 1 min )
    [R] 6D Rotation Representation For Unconstrained Head Pose
    Abstract: In this paper, we present a method for unconstrained end-to-end head pose estimation. We address the problem of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This way, our method can learn the full rotation appearance which is contrary to previous approaches that restrict the pose prediction to a narrow-angle for satisfactory results. In addition, we propose a geodesic distance-based loss to penalize our network with respect to the SO(3) manifold geometry. Experiments on the public AFLW2000 and BIWI datasets demonstrate that our proposed method significantly outperforms other state-of-the-art methods by up to 20%. We open-source our training and testing code along with our pre-trained models: https://github.com/thohemp/6DRepNet. PDF on ResearchGate / Arxiv (Preprint) Testing code, training code and fine-tuned models: https://github.com/thohemp/6DRepNet https://i.redd.it/bp4kyr7vj6l81.gif submitted by /u/thohemp [link] [comments]  ( 1 min )
    [R] The Quest for a Common Model of the Intelligent Decision Maker
    submitted by /u/hardmaru [link] [comments]
    [D] What software/products does everyone use for conducting A/B tests with machine learning models?
    I.e. for testing the impact of the machine learning model on different KPIs. Not entirely sure how to get started with this... submitted by /u/Adi-Dewan [link] [comments]  ( 1 min )
    [R] DeepNet: Scaling Transformers to 1,000 Layers
    submitted by /u/nighthawk454 [link] [comments]
  • Open

    The BIG challenge
    I have a challenge 4 u if u can draw read this if u can't go away ... ... just joking he he the challenge is u have draw something with a circle there has 2 be a circle no circle does't count and are the rules are have 2 use a circle and u can do any thing to the circle it can be on paper , online it does't matter rule 2 it has to be appropriate no bad things u horny fuckers at the end this month i will pick a winner the and winner will be tag in everything I post 4 in till end of summer submitted by /u/Mrplaytime770 [link] [comments]  ( 1 min )
    Feeding The World Via Precision Agriculture, Computer Vision & Machine Learning - Dr. Gaurav Bansal, PhD - Blue River Technology, John Deere
    submitted by /u/ObjectiveGround5 [link] [comments]
    Researchers From Microsoft and CMU Introduce ‘COMPASS’: A General-Purpose Large-Scale Pretraining Pipeline For Perception-Action Loops in Autonomous Systems
    Humans have the essential cognitive capacity to comprehend the world via multimodal sensory inputs and use this ability to perform a wide range of activities. Autonomous agents can observe an environment’s underlying condition through many sensors in a comparable way and evaluate how to complete a job correctly. Localization (or “where am I?”) is a fundamental question that an autonomous agent must answer before navigation, and it is frequently addressed via visual Odometry. Collision avoidance and knowledge of the temporal evolution of their condition concerning the surroundings are required for highly dynamic tasks such as car racing. As a result, it’s probable to create general-purpose pre-trained models for autonomous systems independent of functions and form factors. In a recent paper, COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems, a general-purpose pre-training pipeline was proposed to circumvent such restrictions coming from task-specific models. COMPASS has three main features: It is a large-scale pre-training pipeline for perception-action loops in autonomous systems that may be used for various applications. COMPASS’s learned representations adapt to multiple situations and dramatically increase performance on relevant downstream tasks. It was created with multimodal data in mind. The framework is designed to use rich information from several sensor modalities, given the presence of a large number of sensors in autonomous systems. It is taught self-supervised without manual labeling, allowing it to use enormous amounts of data for pre-training. Continue reading my full summary article on this research Paper: https://www.microsoft.com/en-us/research/uploads/prod/2022/02/COMPASS.pdf Github: https://github.com/microsoft/COMPASS https://preview.redd.it/vtwwp0w4i7l81.png?width=1536&format=png&auto=webp&s=f2329c18705bcfab87a41e85ca42243489b51456 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    How the Great Automation Will Kill People
    submitted by /u/Beautiful-Credit-868 [link] [comments]  ( 1 min )
    The world is changing only faster
    submitted by /u/HumanSeeing [link] [comments]  ( 1 min )
    Stock prediction based on Twitter sentiment analysis
    If anyone here has done stock prediction based on Twitter sentiment analysis, I wonder is it doable? I saw a dataset of tweets about Apple and want to train a random forest classifier with it, but I'm not sure that it will produce good results for stock price prediction. Also, what could be a more suitable project idea for sentiment analysis? I feel like it's really hard to predict stock prices with people's opinions. I would really appreciate it if someone could share any advice on this! submitted by /u/studentani [link] [comments]  ( 2 min )
    The Execution Process of a Tensor in Deep Learning Framework
    submitted by /u/Just0by [link] [comments]
    I made an Indiegame about Artificial Intelligence, my Autistic Special interest, I hope that you will like it.
    submitted by /u/LifeSymbiont [link] [comments]
  • Open

    TRILLsson: Small, Universal Speech Representations for Paralinguistic Tasks
    Posted by Joel Shor, Staff Software Engineer, Google Research In recent years, we have seen dramatic improvements on lexical tasks such as automatic speech recognition (ASR). However, machine systems still struggle to understand paralinguistic aspects — such as tone, emotion, whether a speaker is wearing a mask, etc. Understanding these aspects represents one of the remaining difficult problems in machine hearing. In addition, state-of-the-art results often come from ultra-large models trained on private data, making them impractical to run on mobile devices or to release publicly. In “Universal Paralinguistic Speech Representations Using Self-Supervised Conformers”, to appear in ICASSP 2022, we introduce CAP12— the 12th layer of a 600M parameter model trained on the YT-U training da…  ( 8 min )
  • Open

    Can a single CNN accurately classify both ambient sounds and the emotion in the case of human voice, or should there be 3 neural networks (1 to decide whether it's human voice or another sound) and the other 2 forked?
    submitted by /u/retlaoge47 [link] [comments]  ( 1 min )
    What is wrong with my neural network implementation?
    I am trying to implement a simple neural network with a single hidden layer, but the network is not converging for xor. Please help me figure out what's wrong with my implementation. class SimpleNeuralNetwork: def __init__(self, n_in, n_hidden, n_out): self.n_in = n_in self.n_hidden = n_hidden self.n_out = n_out self.w1 = [[random.random() for _ in range(n_in)] for _ in range(n_hidden)] self.w2 = [[random.random() for _ in range(n_hidden)] for _ in range(n_out)] self.b1 = [random.random() for _ in range(n_hidden)] self.b2 = [random.random() for _ in range(n_out)] def predict(self, x): hidden = [self.b1[i] for i in range(self.n_hidden)] for i in range(self.n_hidden): for j in range(self.n_in): hidden[i] += self.w1[i][j] * x[j] hidden = [sigmoid(hidden[i]) for i in range(self.n_hidden)] out…  ( 2 min )
    Inputs - Neural network
    I am sorry if this question is too obvious. Let's say I want to train a neural network and each of the samples has 40 inputs features. Now, if 10 of these features are always the same for all the samples, should I then take them out of the feature set? Or are they still useful for the Neural Network to learn? Are there cases when these inputs help the model even when they are the same in each sample? I would appreciate it if anyone can answer this. Thank you. submitted by /u/CardiologistGlass934 [link] [comments]  ( 1 min )
    Help in understanding a few points in the article - "Weight Uncertainty in Neural Networks" - Bayes by Backprop
    Hey guys! This is my first post here :) I'm currently working on a school project which contains summarizing an article. I got most of it covered but there are some points I don't understand and a bit of math I could use some help in. The article is "Weight Uncertainty in Neural Networks" by Blundell et al. (2015). Is there anyone here familiar with this article, or similar Bayesian learning algorithms that can help me, please? Everything in this article is new material for me that I had to learn alone almost from scratch on the internet. Any help would be greatly appreciated since I don't have anyone to ask about this. Some of my questions are: At the end of section 2, after explaining MAPs, I didn't manage to do the algebra that gets us from Gaussian/Laplace prior to L2/L1 regula…  ( 2 min )
    New to data science and the teaching assistant to the class is no help. can someone help me with this?
    The momentum constant α is normally assigned a positive value in the range of (0,1). Consider a negative value of α in the range of (-1,0). Suppose the error surface of a 1-dimensional weight x is a tilted line (i.e., a decreasing function of the weight w with the same negative derivative everywhere). Derive the expression of the weight change at time step t. Compare it with the case of a positive value of α, and comment on the result. submitted by /u/halfjain [link] [comments]  ( 1 min )
  • Open

    3 Questions: Fotini Christia on racial equity and data science
    A new MIT-wide effort launched by the Institute for Data, Systems, and Society uses social science and computation to address systemic racism.  ( 5 min )
  • Open

    Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly
    At the 2021 AWS re:Invent conference in Las Vegas, we demoed Read For Me at the AWS Builders Fair—a website that helps the visually impaired hear documents. For better quality, view the video here. Adaptive technology and accessibility features are often expensive, if they’re available at all. Audio books help the visually impaired read. Audio […]  ( 7 min )
  • Open

    Introduction to NFT’s
    “Nobody ever changed the world by doing what everyone else was doing” — Mark Cuban  ( 3 min )
  • Open

    GFN Thursday Marches Forward With 27 Games Coming to GeForce NOW This Month
    A new month means a whole new set of games coming to GeForce NOW. Members can look forward to 27 titles joining the GeForce NOW library in March, including day-and-date releases like Shadow Warrior 3, with support for NVIDIA DLSS. Bring a Katana to a Gunfight Shoot, slash and slide into Shadow Warrior 3, new Read article > The post GFN Thursday Marches Forward With 27 Games Coming to GeForce NOW This Month appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Implicit Bayesian Inference in Large Language Models
    This intriguing paper kept me thinking long enough for me to I decide it's time to resurrect my blogging (I started writing this during ICLR review period, and realised it might be a good idea to wait until that's concluded) Sang Michael Xie, Aditi Raghunathan, Percy  ( 4 min )
  • Open

    Continual Learning for Monolingual End-to-End Automatic Speech Recognition. (arXiv:2112.09427v2 [eess.AS] UPDATED)
    Adapting Automatic Speech Recognition (ASR) models to new domains leads to a deterioration of performance on the original domain(s), a phenomenon called Catastrophic Forgetting (CF). Even monolingual ASR models cannot be extended to new accents, dialects, topics, etc. without suffering from CF, making them unable to be continually enhanced without storing all past data. Fortunately, Continual Learning (CL) methods, which aim to enable continual adaptation while overcoming CF, can be used. In this paper, we implement an extensive number of CL methods for End-to-End ASR and test and compare their ability to extend a monolingual Hybrid CTC-Transformer model across four new tasks. We find that the best performing CL method closes the gap between the fine-tuned model (lower bound) and the model trained jointly on all tasks (upper bound) by more than 40%, while requiring access to only 0.6% of the original data.
    Background Mixup Data Augmentation for Hand and Object-in-Contact Detection. (arXiv:2202.13941v2 [cs.CV] UPDATED)
    Detecting the positions of human hands and objects-in-contact (hand-object detection) in each video frame is vital for understanding human activities from videos. For training an object detector, a method called Mixup, which overlays two training images to mitigate data bias, has been empirically shown to be effective for data augmentation. However, in hand-object detection, mixing two hand-manipulation images produces unintended biases, e.g., the concentration of hands and objects in a specific region degrades the ability of the hand-object detector to identify object boundaries. We propose a data-augmentation method called Background Mixup that leverages data-mixing regularization while reducing the unintended effects in hand-object detection. Instead of mixing two images where a hand and an object in contact appear, we mix a target training image with background images without hands and objects-in-contact extracted from external image sources, and use the mixed images for training the detector. Our experiments demonstrated that the proposed method can effectively reduce false positives and improve the performance of hand-object detection in both supervised and semi-supervised learning settings.
    Interpreting Video Engagement Prediction: A Deep Learning Framework. (arXiv:2101.01076v4 [cs.LG] UPDATED)
    Health misinformation on social media devastates physical and mental health, invalidates health gains, and potentially costs lives. Understanding how health misinformation is transmitted is an urgent goal for researchers, social media platforms, health sectors, and policymakers to mitigate those ramifications. Deep learning methods have been deployed to predict the spread of misinformation. While achieving the state-of-the-art predictive performance, deep learning methods lack the interpretability due to their blackbox nature. To remedy this gap, this study proposes a novel interpretable deep learning approach, Generative Adversarial Network based Piecewise Wide and Attention Deep Learning (GAN-PiWAD), to predict health misinformation transmission in social media. Improving upon state-of-the-art interpretable methods, GAN-PiWAD captures the interactions among multi-modal data, offers unbiased estimation of the total effect of each feature, and models the dynamic total effect of each feature when its value varies. We select features according to social exchange theory and evaluate GAN-PiWAD on 4,445 misinformation videos. The proposed approach outperformed strong benchmarks. Interpretation of GAN-PiWAD indicates video description, negative video content, and channel credibility are key features that drive viral transmission of misinformation. This study contributes to IS with a novel interpretable deep learning method that is generalizable to understand other human decision factors. Our findings provide direct implications for social media platforms and policymakers to design proactive interventions to identify misinformation, control transmissions, and manage infodemics.
    Memory Planning for Deep Neural Networks. (arXiv:2203.00448v1 [cs.LG])
    We study memory allocation patterns in DNNs during inference, in the context of large-scale systems. We observe that such memory allocation patterns, in the context of multi-threading, are subject to high latencies, due to \texttt{mutex} contention in the system memory allocator. Latencies incurred due to such \texttt{mutex} contention produce undesirable bottlenecks in user-facing services. Thus, we propose a "memorization" based technique, \texttt{MemoMalloc}, for optimizing overall latency, with only moderate increases in peak memory usage. Specifically, our technique consists of a runtime component, which captures all allocations and uniquely associates them with their high-level source operation, and a static analysis component, which constructs an efficient allocation "plan". We present an implementation of \texttt{MemoMalloc} in the PyTorch deep learning framework and evaluate memory consumption and execution performance on a wide range of DNN architectures. We find that \texttt{MemoMalloc} outperforms state-of-the-art general purpose memory allocators, with respect to DNN inference latency, by as much as 40\%.
    Last Layer Marginal Likelihood for Invariance Learning. (arXiv:2106.07512v2 [stat.ML] UPDATED)
    Data augmentation is often used to incorporate inductive biases into models. Traditionally, these are hand-crafted and tuned with cross validation. The Bayesian paradigm for model selection provides a path towards end-to-end learning of invariances using only the training data, by optimising the marginal likelihood. Computing the marginal likelihood is hard for neural networks, but success with tractable approaches that compute the marginal likelihood for the last layer only raises the question of whether this convenient approach might be employed for learning invariances. We show partial success on standard benchmarks, in the low-data regime and on a medical imaging dataset by designing a custom optimisation routine. Introducing a new lower bound to the marginal likelihood allows us to perform inference for a larger class of likelihood functions than before. On the other hand, we demonstrate failure modes on the CIFAR10 dataset, where the last layer approximation is not sufficient due to the increased complexity of our neural network. Our results indicate that once more sophisticated approximations become available the marginal likelihood is a promising approach for invariance learning in neural networks.
    Performance of Distribution Regression with Doubling Measure under the seek of Closest Point. (arXiv:2203.00155v1 [cs.LG])
    We study the distribution regression problem assuming the distribution of distributions has a doubling measure larger than one. First, we explore the geometry of any distributions that has doubling measure larger than one and build a small theory around it. Then, we show how to utilize this theory to find one of the nearest distributions adaptively and compute the regression value based on these distributions. Finally, we provide the accuracy of the suggested method here and provide the theoretical analysis for it.
    Model Error Propagation via Learned Contraction Metrics for Safe Feedback Motion Planning of Unknown Systems. (arXiv:2104.08695v2 [cs.RO] UPDATED)
    We present a method for contraction-based feedback motion planning of locally incrementally exponentially stabilizable systems with unknown dynamics that provides probabilistic safety and reachability guarantees. Given a dynamics dataset, our method learns a deep control-affine approximation of the dynamics. To find a trusted domain where this model can be used for planning, we obtain an estimate of the Lipschitz constant of the model error, which is valid with a given probability, in a region around the training data, providing a local, spatially-varying model error bound. We derive a trajectory tracking error bound for a contraction-based controller that is subjected to this model error, and then learn a controller that optimizes this tracking bound. With a given probability, we verify the correctness of the controller and tracking error bound in the trusted domain. We then use the trajectory error bound together with the trusted domain to guide a sampling-based planner to return trajectories that can be robustly tracked in execution. We show results on a 4D car, a 6D quadrotor, and a 22D deformable object manipulation task, showing our method plans safely with learned models of high-dimensional underactuated systems, while baselines that plan without considering the tracking error bound or the trusted domain can fail to stabilize the system and become unsafe.
    PaSca: a Graph Neural Architecture Search System under the Scalable Paradigm. (arXiv:2203.00638v1 [cs.LG])
    Graph neural networks (GNNs) have achieved state-of-the-art performance in various graph-based tasks. However, as mainstream GNNs are designed based on the neural message passing mechanism, they do not scale well to data size and message passing steps. Although there has been an emerging interest in the design of scalable GNNs, current researches focus on specific GNN design, rather than the general design space, limiting the discovery of potential scalable GNN models. This paper proposes PasCa, a new paradigm and system that offers a principled approach to systemically construct and explore the design space for scalable GNNs, rather than studying individual designs. Through deconstructing the message passing mechanism, PasCa presents a novel Scalable Graph Neural Architecture Paradigm (SGAP), together with a general architecture design space consisting of 150k different designs. Following the paradigm, we implement an auto-search engine that can automatically search well-performing and scalable GNN architectures to balance the trade-off between multiple criteria (e.g., accuracy and efficiency) via multi-objective optimization. Empirical studies on ten benchmark datasets demonstrate that the representative instances (i.e., PasCa-V1, V2, and V3) discovered by our system achieve consistent performance among competitive baselines. Concretely, PasCa-V3 outperforms the state-of-the-art GNN method JK-Net by 0.4\% in terms of predictive accuracy on our large industry dataset while achieving up to $28.3\times$ training speedups.
    Invariance, encodings, and generalization: learning identity effects with neural networks. (arXiv:2101.08386v4 [cs.LG] UPDATED)
    Often in language and other areas of cognition, whether two components of an object are identical or not determines if it is well formed. We call such constraints identity effects. When developing a system to learn well-formedness from examples, it is easy enough to build in an identify effect. But can identity effects be learned from the data without explicit guidance? We provide a framework in which we can rigorously prove that algorithms satisfying simple criteria cannot make the correct inference. We then show that a broad class of learning algorithms including deep feedforward neural networks trained via gradient-based algorithms (such as stochastic gradient descent or the Adam method) satisfy our criteria, dependent on the encoding of inputs. In some broader circumstances we are able to provide adversarial examples that the network necessarily classifies incorrectly. Finally, we demonstrate our theory with computational experiments in which we explore the effect of different input encodings on the ability of algorithms to generalize to novel inputs. This allows us to show similar effects to those predicted by theory for more realistic methods that violate some of the conditions of our theoretical results.
    Weighted Training for Cross-Task Learning. (arXiv:2105.14095v2 [cs.LG] UPDATED)
    In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.
    Improved Approximation Algorithms for Individually Fair Clustering. (arXiv:2106.14043v2 [cs.DS] UPDATED)
    We consider the $k$-clustering problem with $\ell_p$-norm cost, which includes $k$-median, $k$-means and $k$-center, under an individual notion of fairness proposed by Jung et al. [2020]: given a set of points $P$ of size $n$, a set of $k$ centers induces a fair clustering if every point in $P$ has a center among its $n/k$ closest neighbors. Mahabadi and Vakilian [2020] presented a $(p^{O(p)},7)$-bicriteria approximation for fair clustering with $\ell_p$-norm cost: every point finds a center within distance at most $7$ times its distance to its $(n/k)$-th closest neighbor and the $\ell_p$-norm cost of the solution is at most $p^{O(p)}$ times the cost of an optimal fair solution. In this work, for any $\varepsilon>0$, we present an improved $(16^p +\varepsilon,3)$-bicriteria for this problem. Moreover, for $p=1$ ($k$-median) and $p=\infty$ ($k$-center), we present improved cost-approximation factors $7.081+\varepsilon$ and $3+\varepsilon$ respectively. To achieve our guarantees, we extend the framework of [Charikar et al., 2002, Swamy, 2016] and devise a $16^p$-approximation algorithm for the facility location with $\ell_p$-norm cost under matroid constraint which might be of an independent interest. Besides, our approach suggests a reduction from our individually fair clustering to a clustering with a group fairness requirement proposed by Kleindessner et al. [2019], which is essentially the median matroid problem [Krishnaswamy et al., 2011].
    Deep Adaptive Arbitrary Polynomial Chaos Expansion: A Mini-data-driven Semi-supervised Method for Uncertainty Quantification. (arXiv:2107.10428v2 [cs.LG] UPDATED)
    The surrogate model-based uncertainty quantification method has drawn much attention in many engineering fields. Polynomial chaos expansion (PCE) and deep learning (DL) are powerful methods for building a surrogate model. However, PCE needs to increase the expansion order to improve the accuracy of the surrogate model, which causes more labeled data to solve the expansion coefficients, and DL also requires a lot of labeled data to train the deep neural network (DNN). First of all, this paper proposes the adaptive arbitrary polynomial chaos (aPC) and proves two properties about the adaptive expansion coefficients. Based on the adaptive aPC, a semi-supervised deep adaptive arbitrary polynomial chaos expansion (Deep aPCE) method is proposed to reduce the training data cost and improve the surrogate model accuracy. For one hand, the Deep aPCE method uses two properties of the adaptive aPC to assist in training the DNN based on only a small amount of labeled data and many unlabeled data, significantly reducing the training data cost. On the other hand, the Deep aPCE method adopts the DNN to fine-tune the adaptive expansion coefficients dynamically, improving the Deep aPCE model accuracy with lower expansion order. Besides, the Deep aPCE method can directly construct accurate surrogate models of the high dimensional stochastic systems without complex dimension-reduction and model decomposition operations. Five numerical examples and an actual engineering problem are used to verify the effectiveness of the Deep aPCE method.
    A Dynamical Estimation and Prediction for Covid19 on Romania using ensemble neural networks. (arXiv:2203.00407v1 [cs.LG])
    In this paper, we propose an analysis of Covid19 evolution and prediction on Romania combined with the mathematical model of SIRD, an extension of the classical model SIR, which includes the deceased as a separate category. The reason is that, because we can not fully trust the reported numbers of infected or recovered people, we base our analysis on the more reliable number of deceased people. In addition, one of the parameters of our model includes the proportion of infected and tested versus infected. Since there are many factors which have an impact on the evolution of the pandemic, we decide to treat the estimation and the prediction based on the previous 7 days of data, particularly important here being the number of deceased. We perform the estimation and prediction using neural networks in two steps. Firstly, by simulating data with our model, we train several neural networks which learn the parameters of the model. Secondly, we use an ensemble of ten of these neural networks to forecast the parameters from the real data of Covid19 in Romania. Many of these results are backed up by a theorem which guarantees that we can recover the parameters from the reported data.
    Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization. (arXiv:2203.00611v1 [cs.DC])
    There is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs.
    A Neural Ordinary Differential Equation Model for Visualizing Deep Neural Network Behaviors in Multi-Parametric MRI based Glioma Segmentation. (arXiv:2203.00628v1 [q-bio.QM])
    Purpose: To develop a neural ordinary differential equation (ODE) model for visualizing deep neural network (DNN) behavior during multi-parametric MRI (mp-MRI) based glioma segmentation as a method to enhance deep learning explainability. Methods: By hypothesizing that deep feature extraction can be modeled as a spatiotemporally continuous process, we designed a novel deep learning model, neural ODE, in which deep feature extraction was governed by an ODE without explicit expression. The dynamics of 1) MR images after interactions with DNN and 2) segmentation formation can be visualized after solving ODE. An accumulative contribution curve (ACC) was designed to quantitatively evaluate the utilization of each MRI by DNN towards the final segmentation results. The proposed neural ODE model was demonstrated using 369 glioma patients with a 4-modality mp-MRI protocol: T1, contrast-enhanced T1 (T1-Ce), T2, and FLAIR. Three neural ODE models were trained to segment enhancing tumor (ET), tumor core (TC), and whole tumor (WT). The key MR modalities with significant utilization by DNN were identified based on ACC analysis. Segmentation results by DNN using only the key MR modalities were compared to the ones using all 4 MR modalities. Results: All neural ODE models successfully illustrated image dynamics as expected. ACC analysis identified T1-Ce as the only key modality in ET and TC segmentations, while both FLAIR and T2 were key modalities in WT segmentation. Compared to the U-Net results using all 4 MR modalities, Dice coefficient of ET (0.784->0.775), TC (0.760->0.758), and WT (0.841->0.837) using the key modalities only had minimal differences without significance. Conclusion: The neural ODE model offers a new tool for optimizing the deep learning model inputs with enhanced explainability. The presented methodology can be generalized to other medical image-related deep learning applications.
    MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA. (arXiv:2102.13347v3 [stat.ML] UPDATED)
    Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these MDA versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-MDA empirically outperforms its competitors on both simulated and real data for variable selection. An open source implementation in R and C++ is available online.
    Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings. (arXiv:2203.00211v1 [cs.CL])
    In order to equip NLP systems with selective prediction capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline 'MaxProb' remains to be explored. To this end, we systematically study 'selective prediction' in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, Monte-Carlo Dropout outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.
    Long-Tailed Classification with Gradual Balanced Loss and Adaptive Feature Generation. (arXiv:2203.00452v1 [cs.CV])
    The real-world data distribution is essentially long-tailed, which poses great challenge to the deep model. In this work, we propose a new method, Gradual Balanced Loss and Adaptive Feature Generator (GLAG) to alleviate imbalance. GLAG first learns a balanced and robust feature model with Gradual Balanced Loss, then fixes the feature model and augments the under-represented tail classes on the feature level with the knowledge from well-represented head classes. And the generated samples are mixed up with real training samples during training epochs. Gradual Balanced Loss is a general loss and it can combine with different decoupled training methods to improve the original performance. State-of-the-art results have been achieved on long-tail datasets such as CIFAR100-LT, ImageNetLT, and iNaturalist, which demonstrates the effectiveness of GLAG for long-tailed visual recognition.
    Realtime strategy for image data labelling using binary models and active sampling. (arXiv:2203.00439v1 [cs.CV])
    Machine learning (ML) and Deep Learning (DL) tasks primarily depend on data. Most of the ML and DL applications involve supervised learning which requires labelled data. In the initial phases of ML realm lack of data used to be a problem, now we are in a new era of big data. The supervised ML algorithms require data to be labelled and of good quality. Labelling task requires a large amount of money and time investment. Data labelling require a skilled person who will charge high for this task, consider the case of the medical field or the data is in bulk that requires a lot of people assigned to label it. The amount of data that is well enough for training needs to be known, money and time can not be wasted to label the whole data. This paper mainly aims to propose a strategy that helps in labelling the data along with oracle in real-time. With balancing on model contribution for labelling is 89 and 81.1 for furniture type and intel scene image data sets respectively. Further with balancing being kept off model contribution is found to be 83.47 and 78.71 for furniture type and flower data sets respectively.
    Explaining a Deep Reinforcement Learning Docking Agent Using Linear Model Trees with User Adapted Visualization. (arXiv:2203.00368v1 [cs.RO])
    Deep neural networks (DNNs) can be useful within the marine robotics field, but their utility value is restricted by their black-box nature. Explainable artificial intelligence methods attempt to understand how such black-boxes make their decisions. In this work, linear model trees (LMTs) are used to approximate the DNN controlling an autonomous surface vessel (ASV) in a simulated environment and then run in parallel with the DNN to give explanations in the form of feature attributions in real-time. How well a model can be understood depends not only on the explanation itself, but also on how well it is presented and adapted to the receiver of said explanation. Different end-users may need both different types of explanations, as well as different representations of these. The main contributions of this work are (1) significantly improving both the accuracy and the build time of a greedy approach for building LMTs by introducing ordering of features in the splitting of the tree, (2) giving an overview of the characteristics of the seafarer/operator and the developer as two different end-users of the agent and receiver of the explanations, and (3) suggesting a visualization of the docking agent, the environment, and the feature attributions given by the LMT for when the developer is the end-user of the system, and another visualization for when the seafarer or operator is the end-user, based on their different characteristics.
    Minibatch vs Local SGD for Heterogeneous Distributed Learning. (arXiv:2006.04735v5 [cs.LG] UPDATED)
    We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.
    Nonlinear Kernel Support Vector Machine with 0-1 Soft Margin Loss. (arXiv:2203.00399v1 [cs.LG])
    Recent advance on linear support vector machine with the 0-1 soft margin loss ($L_{0/1}$-SVM) shows that the 0-1 loss problem can be solved directly. However, its theoretical and algorithmic requirements restrict us extending the linear solving framework to its nonlinear kernel form directly, the absence of explicit expression of Lagrangian dual function of $L_{0/1}$-SVM is one big deficiency among of them. In this paper, by applying the nonparametric representation theorem, we propose a nonlinear model for support vector machine with 0-1 soft margin loss, called $L_{0/1}$-KSVM, which cunningly involves the kernel technique into it and more importantly, follows the success on systematically solving its linear task. Its optimal condition is explored theoretically and a working set selection alternating direction method of multipliers (ADMM) algorithm is introduced to acquire its numerical solution. Moreover, we firstly present a closed-form definition to the support vector (SV) of $L_{0/1}$-KSVM. Theoretically, we prove that all SVs of $L_{0/1}$-KSVM are only located on the parallel decision surfaces. The experiment part also shows that $L_{0/1}$-KSVM has much fewer SVs, simultaneously with a decent predicting accuracy, when comparing to its linear peer $L_{0/1}$-SVM and the other six nonlinear benchmark SVM classifiers.
    Side-effects of Learning from Low Dimensional Data Embedded in an Euclidean Space. (arXiv:2203.00614v1 [cs.LG])
    The low dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low dimensional manifolds embedded in a high dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Information Field Theory and Artificial Intelligence. (arXiv:2112.10133v2 [stat.ML] UPDATED)
    Information field theory (IFT), the information theory for fields, is a mathematical framework for signal reconstruction and non-parametric inverse problems. Artificial intelligence (AI) and machine learning (ML) aim at generating intelligent systems including such for perception, cognition, and learning. This overlaps with IFT, which is designed to address perception, reasoning, and inference tasks. Here, the relation between concepts and tools in IFT and those in AI and ML research are discussed. In the context of IFT, fields denote physical quantities that change continuously as a function of space (and time) and information theory refers to Bayesian probabilistic logic equipped with the associated entropic information measures. Reconstructing a signal with IFT is a computational problem similar to training a generative neural network (GNN) in ML. In this paper, the process of inference in IFT is reformulated in terms of GNN training. In contrast to classical neural networks, IFT based GNNs can operate without pre-training thanks to incorporating expert knowledge into their architecture. Furthermore, the cross-fertilization of variational inference methods used in IFT and ML are discussed. These discussions suggests that IFT is well suited to address many problems in AI and ML research and application.
    Gait Events Prediction using Hybrid CNN-RNN-based Deep Learning models through a Single Waist-worn Wearable Sensor. (arXiv:2203.00503v1 [eess.SP])
    Elderly gait is a source of rich information about their physical and mental health condition. As an alternative to the multiple sensors on the lower body parts, a single sensor on the pelvis has a positional advantage and an abundance of information acquirable. This study aimed to explore a way of improving the accuracy of gait event detection in the elderly using a single sensor on the waist and deep learning models. Data was gathered from elderly subjects equipped with three IMU sensors while they walked. The input was taken only from the waist sensor was used to train 16 deep-learning models including CNN, RNN, and CNN-RNN hybrid with or without the Bidirectional and Attention mechanism. The groundtruth was extracted from foot IMU sensors. Fairly high accuracy of 99.73% and 93.89% was achieved by the CNN-BiGRU-Att model at the tolerance window of $\pm$6TS ($\pm$6ms) and $\pm$1TS ($\pm$1ms) respectively. Advancing from the previous studies exploring gait event detection, the model showed a great improvement in terms of its prediction error having an MAE of 6.239ms and 5.24ms for HS and TO events respectively at the tolerance window of $\pm$1TS. The results showed that the use of CNN-RNN hybrid models with Attention and Bidirectional mechanisms is promising for accurate gait event detection using a single waist sensor. The study can contribute to reducing the burden of gait detection and increase its applicability in future wearable devices that can be used for remote health monitoring (RHM) or diagnosis based thereon.
    UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control. (arXiv:2106.11171v3 [eess.AS] UPDATED)
    We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://anonymous-authors2022.github.io/paper_works/UniTTS/demos/.
    EfficientWord-Net: An Open Source Hotword Detection Engine based on One-shot Learning. (arXiv:2111.00379v2 [cs.CL] UPDATED)
    Voice assistants like Siri, Google Assistant, Alexa etc. are used widely across the globe for home automation, these require the use of special phrases also known as hotwords to wake it up and perform an action like "Hey Alexa!", "Ok Google!" and "Hey Siri!" etc. These hotwords are detected with lightweight real-time engines whose purpose is to detect the hotwords uttered by the user. This paper presents the design and implementation of a hotword detection engine based on one-shot learning which detects the hotword uttered by the user in real-time with just one or few training samples of the hotword. This approach is efficient when compared to existing implementations because the process of adding a new hotword in the existing systems requires enormous amounts of positive and negative training samples and the model needs to retrain for every hotword. This makes the existing implementations inefficient in terms of computation and cost. The architecture proposed in this paper has achieved an accuracy of 94.51%.
    A Discrete-event-based Simulator for Distributed Deep Learning. (arXiv:2112.00952v2 [cs.LG] UPDATED)
    New intelligence applications are driving increasing interest in deploying deep neural networks (DNN) in a distributed way. To set up distributed deep learning involves alterations of a great number of the parameter configurations of network/edge devices and DNN models, which are crucial to achieve best performances. Simulations measure scalability of intelligence applications in the early stage, as well as to determine the effects of different configurations, thus highly desired. However, work on simulating the distributed intelligence environment is still in its infancy. The existing simulation frameworks, such as NS-3, etc., cannot extended in a straightforward way to support simulations of distributed learning. In this paper, we propose a novel discrete event simulator, sim4DistrDL, which includes a deep learning module and a network simulation module to facilitate simulation of DNN-based distributed applications. Specifically, we give the design and implementation of the proposed learning simulator and present an illustrative use case.
    Setting Fair Incentives to Maximize Improvement. (arXiv:2203.00134v1 [cs.GT])
    We consider the problem of helping agents improve by setting short-term goals. Given a set of target skill levels, we assume each agent will try to improve from their initial skill level to the closest target level within reach or do nothing if no target level is within reach. We consider two models: the common improvement capacity model, where agents have the same limit on how much they can improve, and the individualized improvement capacity model, where agents have individualized limits. Our goal is to optimize the target levels for social welfare and fairness objectives, where social welfare is defined as the total amount of improvement, and fairness objectives are considered where the agents belong to different underlying populations. A key technical challenge of this problem is the non-monotonicity of social welfare in the set of target levels, i.e., adding a new target level may decrease the total amount of improvement as it may get easier for some agents to improve. This is especially challenging when considering multiple groups because optimizing target levels in isolation for each group and outputting the union may result in arbitrarily low improvement for a group, failing the fairness objective. Considering these properties, we provide algorithms for optimal and near-optimal improvement for both social welfare and fairness objectives. These algorithmic results work for both the common and individualized improvement capacity models. Furthermore, we show a placement of target levels exists that is approximately optimal for the social welfare of each group. Unlike the algorithmic results, this structural statement only holds in the common improvement capacity model, and we show counterexamples in the individualized improvement capacity model. Finally, we extend our algorithms to learning settings where we have only sample access to the initial skill levels of agents.
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control. (arXiv:2201.10936v3 [cs.SD] UPDATED)
    Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.
    Disentangled Sequence Clustering for Human Intention Inference. (arXiv:2101.09500v3 [cs.RO] UPDATED)
    Equipping robots with the ability to infer human intent is a vital precondition for effective collaboration. Most computational approaches towards this objective derive a probabilistic distribution of "intent" conditioned on the robot's perceived sensory state. However, these approaches typically assume task-specific labels of human intent are known a priori. To overcome this constraint, we propose the Disentangled Sequence Clustering Variational Autoencoder (DiSCVAE), a clustering framework capable of learning such a distribution of intent in an unsupervised manner. The DiSCVAE leverages recent advances in unsupervised learning to recover a disentangled latent representation of sequential data, separating time-varying local features from time-invariant global aspects. Though unlike previous frameworks for disentanglement, the proposed variant also infers a discrete variable to form a latent mixture model and enable clustering of global sequence concepts, e.g. intentions from observed human behaviour. To evaluate the DiSCVAE, we first validate its capacity to discover classes from unlabelled sequences using video datasets of bouncing digits and 2D animations. We then report results from a real-world human-robot interaction experiment conducted on a robotic wheelchair. Our findings reveal that the inferred discrete variable coincides with human intent and can thus improve assistance in collaborative settings, such as shared control.
    NeuRecover: Regression-Controlled Repair of Deep Neural Networks with Training History. (arXiv:2203.00191v1 [cs.LG])
    Systematic techniques to improve quality of deep neural networks (DNNs) are critical given the increasing demand for practical applications including safety-critical ones. The key challenge comes from the little controllability in updating DNNs. Retraining to fix some behavior often has a destructive impact on other behavior, causing regressions, i.e., the updated DNN fails with inputs correctly handled by the original one. This problem is crucial when engineers are required to investigate failures in intensive assurance activities for safety or trust. Search-based repair techniques for DNNs have potentials to tackle this challenge by enabling localized updates only on "responsible parameters" inside the DNN. However, the potentials have not been explored to realize sufficient controllability to suppress regressions in DNN repair tasks. In this paper, we propose a novel DNN repair method that makes use of the training history for judging which DNN parameters should be changed or not to suppress regressions. We implemented the method into a tool called NeuRecover and evaluated it with three datasets. Our method outperformed the existing method by achieving often less than a quarter, even a tenth in some cases, number of regressions. Our method is especially effective when the repair requirements are tight to fix specific failure types. In such cases, our method showed stably low rates (<2%) of regressions, which were in many cases a tenth of regressions caused by retraining.
    Short-term passenger flow prediction for multi-traffic modes: A residual network and Transformer based multi-task learning method. (arXiv:2203.00422v1 [cs.LG])
    With the prevailing of mobility as a service (MaaS), it becomes increasingly important to manage multi-traffic modes simultaneously and cooperatively. As an important component of MaaS, short-term passenger flow prediction for multi-traffic modes has thus been brought into focus. It is a challenging problem because the spatiotemporal features of multi-traffic modes are critically complex. To solve the problem, this paper proposes a multi-task learning-based model, called Res-Transformer, for short-term passenger flow prediction of multi-traffic modes (subway, taxi, and bus). Each traffic mode is treated as a single task in the model. The Res-Transformer consists of three parts: (1) several modified transformer layers comprising 2D convolutional neural networks (CNN) and multi-head attention mechanism, which helps to extract the spatial and temporal features of multi-traffic modes, (2) a residual network architecture used to extract the inner pattern of different traffic modes and enhance the passenger flow features of multi-traffic modes. The Res-Transformer model is evaluated on two large-scale real-world datasets from Beijing, China. One is the region of a traffic hub and the other is the region of a residential area. Experiments are conducted to compare the performance of the proposed model with several state-of-the-art models to prove the effectiveness and robustness of the proposed method. This paper can give critical insights into the short-tern passenger flow prediction for multi-traffic modes.
    Augmenting Reinforcement Learning with Behavior Primitives for Diverse Manipulation Tasks. (arXiv:2110.03655v2 [cs.LG] UPDATED)
    Realistic manipulation tasks require a robot to interact with an environment with a prolonged sequence of motor actions. While deep reinforcement learning methods have recently emerged as a promising paradigm for automating manipulation behaviors, they usually fall short in long-horizon tasks due to the exploration burden. This work introduces Manipulation Primitive-augmented reinforcement Learning (MAPLE), a learning framework that augments standard reinforcement learning algorithms with a pre-defined library of behavior primitives. These behavior primitives are robust functional modules specialized in achieving manipulation goals, such as grasping and pushing. To use these heterogeneous primitives, we develop a hierarchical policy that involves the primitives and instantiates their executions with input parameters. We demonstrate that MAPLE outperforms baseline approaches by a significant margin on a suite of simulated manipulation tasks. We also quantify the compositional structure of the learned behaviors and highlight our method's ability to transfer policies to new task variants and to physical hardware. Videos and code are available at https://ut-austin-rpl.github.io/maple
    DAIR: Data Augmented Invariant Regularization. (arXiv:2110.11205v2 [cs.LG] UPDATED)
    While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM generalizes poorly to distribution shift. This is partly explained by overfitting to spurious features such as background in images or named entities in natural language. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to remedy this problem. In addition, consistency regularization could be applied to further promote model performance to be consistent on the augmented sample and the original one. In this paper, we propose data augmented invariant regularization (DAIR), a simple form of consistency regularization that is applied directly on the loss function rather than intermediate features, making it widely applicable regardless of network architecture or problem setup. We apply DAIR to multiple real-world learning problems, namely robust regression, visual question answering, robust deep neural network training, and neural task-oriented dialog modeling. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal cost and sets new state-of-the-art results in several benchmarks.
    Proceedings of the Artificial Intelligence for Cyber Security (AICS) Workshop at AAAI 2022. (arXiv:2202.14010v2 [cs.CR] UPDATED)
    The workshop will focus on the application of AI to problems in cyber security. Cyber systems generate large volumes of data, utilizing this effectively is beyond human capabilities. Additionally, adversaries continue to develop new attacks. Hence, AI methods are required to understand and protect the cyber domain. These challenges are widely studied in enterprise networks, but there are many gaps in research and practice as well as novel problems in other domains. In general, AI techniques are still not widely adopted in the real world. Reasons include: (1) a lack of certification of AI for security, (2) a lack of formal study of the implications of practical constraints (e.g., power, memory, storage) for AI systems in the cyber domain, (3) known vulnerabilities such as evasion, poisoning attacks, (4) lack of meaningful explanations for security analysts, and (5) lack of analyst trust in AI solutions. There is a need for the research community to develop novel solutions for these practical issues.
    An Explainable AI System for the Diagnosis of High Dimensional Biomedical Data. (arXiv:2107.01820v2 [cs.LG] UPDATED)
    Typical state of the art flow cytometry data samples consists of measures of more than 100.000 cells in 10 or more features. AI systems are able to diagnose such data with almost the same accuracy as human experts. However, there is one central challenge in such systems: their decisions have far-reaching consequences for the health and life of people, and therefore, the decisions of AI systems need to be understandable and justifiable by humans. In this work, we present a novel explainable AI method, called ALPODS, which is able to classify (diagnose) cases based on clusters, i.e., subpopulations, in the high-dimensional data. ALPODS is able to explain its decisions in a form that is understandable for human experts. For the identified subpopulations, fuzzy reasoning rules expressed in the typical language of domain experts are generated. A visualization method based on these rules allows human experts to understand the reasoning used by the AI system. A comparison to a selection of state of the art explainable AI systems shows that ALPODS operates efficiently on known benchmark data and also on everyday routine case data.
    Deep Camera Pose Regression Using Pseudo-LiDAR. (arXiv:2203.00080v1 [cs.CV])
    An accurate and robust large-scale localization system is an integral component for active areas of research such as autonomous vehicles and augmented reality. To this end, many learning algorithms have been proposed that predict 6DOF camera pose from RGB or RGB-D images. However, previous methods that incorporate depth typically treat the data the same way as RGB images, often adding depth maps as additional channels to RGB images and passing them through convolutional neural networks (CNNs). In this paper, we show that converting depth maps into pseudo-LiDAR signals, previously shown to be useful for 3D object detection, is a better representation for camera localization tasks by projecting point clouds that can accurately determine 6DOF camera pose. This is demonstrated by first comparing localization accuracies of a network operating exclusively on pseudo-LiDAR representations, with networks operating exclusively on depth maps. We then propose FusionLoc, a novel architecture that uses pseudo-LiDAR to regress a 6DOF camera pose. FusionLoc is a dual stream neural network, which aims to remedy common issues with typical 2D CNNs operating on RGB-D images. The results from this architecture are compared against various other state-of-the-art deep pose regression implementations using the 7 Scenes dataset. The findings are that FusionLoc performs better than a number of other camera localization methods, with a notable improvement being, on average, 0.33m and 4.35{\deg} more accurate than RGB-D PoseNet. By proving the validity of using pseudo-LiDAR signals over depth maps for localization, there are new considerations when implementing large-scale localization systems.
    Learning Low-Dimensional Nonlinear Structures from High-Dimensional Noisy Data: An Integral Operator Approach. (arXiv:2203.00126v1 [stat.ML])
    We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from high-dimensional and noisy observations, where the datasets are assumed to be sampled from an intrinsically low-dimensional manifold and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension and size of the samples are comparably large, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Numerical simulations and analysis of three real datasets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various manifolds in diverse applications.
    Semi-relaxed Gromov-Wasserstein divergence with applications on graphs. (arXiv:2110.02753v3 [cs.LG] UPDATED)
    Comparing structured objects such as graphs is a fundamental operation involved in many learning tasks. To this end, the Gromov-Wasserstein (GW) distance, based on Optimal Transport (OT), has proven to be successful in handling the specific nature of the associated objects. More specifically, through the nodes connectivity relations, GW operates on graphs, seen as probability measures over specific spaces. At the core of OT is the idea of conservation of mass, which imposes a coupling between all the nodes from the two considered graphs. We argue in this paper that this property can be detrimental for tasks such as graph dictionary or partition learning, and we relax it by proposing a new semi-relaxed Gromov-Wasserstein divergence. Aside from immediate computational benefits, we discuss its properties, and show that it can lead to an efficient graph dictionary learning algorithm. We empirically demonstrate its relevance for complex tasks on graphs such as partitioning, clustering and completion.
    Gaussian Process Constraint Learning for Scalable Chance-Constrained Motion Planning from Demonstrations. (arXiv:2112.04612v2 [cs.RO] UPDATED)
    We propose a method for learning constraints represented as Gaussian processes (GPs) from locally-optimal demonstrations. Our approach uses the Karush-Kuhn-Tucker (KKT) optimality conditions to determine where on the demonstrations the constraint is tight, and a scaling of the constraint gradient at those states. We then train a GP representation of the constraint which is consistent with and which generalizes this information. We further show that the GP uncertainty can be used within a kinodynamic RRT to plan probabilistically-safe trajectories, and that we can exploit the GP structure within the planner to exactly achieve a specified safety probability. We demonstrate our method can learn complex, nonlinear constraints demonstrated on a 5D nonholonomic car, a 12D quadrotor, and a 3-link planar arm, all while requiring minimal prior information on the constraint. Our results suggest the learned GP constraint is accurate, outperforming previous constraint learning methods that require more a priori knowledge.
    Do Transformers use variable binding?. (arXiv:2203.00162v1 [cs.LG])
    Increasing the explainability of deep neural networks (DNNs) requires evaluating whether they implement symbolic computation. One central symbolic capacity is variable binding: linking an input value to an abstract variable held in system-internal memory. Prior work on the computational abilities of DNNs has not resolved the question of whether their internal processes involve variable binding. We argue that the reason for this is fundamental, inherent in the way experiments in prior work were designed. We provide the first systematic evaluation of the variable binding capacities of the state-of-the-art Transformer networks BERT and RoBERTa. Our experiments are designed such that the model must generalize a rule across disjoint subsets of the input vocabulary, and cannot rely on associative pattern matching alone. The results show a clear discrepancy between classification and sequence-to-sequence tasks: BERT and RoBERTa can easily learn to copy or reverse strings even when trained on task-specific vocabularies that are switched in the test set; but both models completely fail to generalize across vocabularies in similar sequence classification tasks. These findings indicate that the effectiveness of Transformers in sequence modelling may lie in their extensive use of the input itself as an external "memory" rather than network-internal symbolic operations involving variable binding. Therefore, we propose a novel direction for future work: augmenting the inputs available to circumvent the lack of network-internal variable binding.
    Analysis of Digitalized ECG Signals Based on Artificial Intelligence and Spectral Analysis Methods Specialized in ARVC. (arXiv:2203.00504v1 [eess.SP])
    Arrhythmogenic right ventricular cardiomyopathy (ARVC) is an inherited heart muscle disease that appears between the second and forth decade of a patient's life, being responsible for 20% of sudden cardiac deaths before the age of 35. The effective and punctual diagnosis of this disease based on Electrocardiograms (ECGs) could have a vital role in reducing premature cardiovascular mortality. In our analysis, we firstly outline the digitalization process of paper-based ECG signals enhanced by a spatial filter aiming to eliminate dark regions in the dataset's images that do not correspond to ECG waveform, producing undesirable noise. Next, we propose the utilization of a low-complexity convolutional neural network for the detection of an arrhythmogenic heart disease, that has not been studied through the usage of deep learning methodology to date, achieving high classification accuracy on a disease the major identification criterion of which are infinitesimal millivolt variations in the ECG's morphology, in contrast with other arrhythmogenic abnormalities. Finally, by performing spectral analysis we investigate significant differentiations in the field of frequencies between normal ECGs and ECGs corresponding to patients suffering from ARVC. The overall research carried out in this article highlights the importance of integrating mathematical methods into the examination and effective diagnosis of various diseases, aiming to a substantial contribution to their successful treatment.
    ONBRA: Rigorous Estimation of the Temporal Betweenness Centrality in Temporal Networks. (arXiv:2203.00653v1 [cs.SI])
    In network analysis, the betweenness centrality of a node informally captures the fraction of shortest paths visiting that node. The computation of the betweenness centrality measure is a fundamental task in the analysis of modern networks, enabling the identification of the most central nodes in such networks. Additionally to being massive, modern networks also contain information about the time at which their events occur. Such networks are often called temporal networks. The temporal information makes the study of the betweenness centrality in temporal networks (i.e., temporal betweenness centrality) much more challenging than in static networks (i.e., networks without temporal information). Moreover, the exact computation of the temporal betweenness centrality is often impractical on even moderately-sized networks, given its extremely high computational cost. A natural approach to reduce such computational cost is to obtain high-quality estimates of the exact values of the temporal betweenness centrality. In this work we present ONBRA, the first sampling-based approximation algorithm for estimating the temporal betweenness centrality values of the nodes in a temporal network, providing rigorous probabilistic guarantees on the quality of its output. ONBRA is able to compute the estimates of the temporal betweenness centrality values under two different optimality criteria for the shortest paths of the temporal network. In addition, ONBRA outputs high-quality estimates with sharp theoretical guarantees leveraging on the \emph{empirical Bernstein bound}, an advanced concentration inequality. Finally, our experimental evaluation shows that ONBRA significantly reduces the computational resources required by the exact computation of the temporal betweenness centrality on several real world networks, while reporting high-quality estimates with rigorous guarantees.
    TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions. (arXiv:2001.11212v2 [stat.ML] UPDATED)
    The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual dependence to the property of interest. However, mutual information requires as input probability distributions, which cannot be reliably estimated from continuous distributions such as physical quantities like lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependences that extends mutual information to random variables of continuous distribution based on cumulative probability distributions. TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of variable sets that are nonlinear statistically related to a property of interest, taking into account the number of data samples as well as the cardinality of the set of variables. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate-dependence measures, and demonstrate the effectiveness of our feature-selection method on a set of standard data sets and a typical scenario in materials science.
    Preemptive Motion Planning for Human-to-Robot Indirect Placement Handovers. (arXiv:2203.00156v1 [cs.RO])
    As technology advances, the need for safe, efficient, and collaborative human-robot-teams has become increasingly important. One of the most fundamental collaborative tasks in any setting is the object handover. Human-to-robot handovers can take either of two approaches: (1) direct hand-to-hand or (2) indirect hand-to-placement-to-pick-up. The latter approach ensures minimal contact between the human and robot but can also result in increased idle time due to having to wait for the object to first be placed down on a surface. To minimize such idle time, the robot must preemptively predict the human intent of where the object will be placed. Furthermore, for the robot to preemptively act in any sort of productive manner, predictions and motion planning must occur in real-time. We introduce a novel prediction-planning pipeline that allows the robot to preemptively move towards the human agent's intended placement location using gaze and gestures as model inputs. In this paper, we investigate the performance and drawbacks of our early intent predictor-planner as well as the practical benefits of using such a pipeline through a human-robot case study.
    Validate on Sim, Detect on Real -- Model Selection for Domain Randomization. (arXiv:2111.00765v3 [cs.RO] UPDATED)
    A practical approach to learning robot skills, often termed sim2real, is to train control policies in simulation and then deploy them on a real robot. Popular techniques to improve the sim2real transfer build on domain randomization (DR) -- training the policy on a diverse set of randomly generated domains with the hope of better generalization to the real world. Due to the large number of hyper-parameters in both the policy learning and DR algorithms, one often ends up with a large number of trained policies, where choosing the best policy among them demands costly evaluation on the real robot. In this work we ask - can we rank the policies without running them in the real world? Our main idea is that a predefined set of real world data can be used to evaluate all policies, using out-of-distribution detection (OOD) techniques. In a sense, this approach can be seen as a `unit test' to evaluate policies before any real world execution. However, we find that by itself, the OOD score can be inaccurate and very sensitive to the particular OOD method. Our main contribution is a simple-yet-effective policy score that combines OOD with an evaluation in simulation. We show that our score - VSDR - can significantly improve the accuracy of policy ranking without requiring additional real world data. We evaluate the effectiveness of VSDR on sim2real transfer in a robotic grasping task with image inputs. We extensively evaluate different DR parameters and OOD methods, and show that VSDR improves policy selection across the board. More importantly, our method achieves significantly better ranking, and uses significantly less data compared to baselines. Project website is available at https://sites.google.com/view/vsdr/home.
    Causal Incremental Graph Convolution for Recommender System Retraining. (arXiv:2108.06889v2 [cs.LG] UPDATED)
    Real-world recommender system needs to be regularly retrained to keep with the new data. In this work, we consider how to efficiently retrain graph convolution network (GCN) based recommender models, which are state-of-the-art techniques for collaborative recommendation. To pursue high efficiency, we set the target as using only new data for model updating, meanwhile not sacrificing the recommendation accuracy compared with full model retraining. This is non-trivial to achieve, since the interaction data participates in both the graph structure for model construction and the loss function for model learning, whereas the old graph structure is not allowed to use in model updating. Towards the goal, we propose a \textit{Causal Incremental Graph Convolution} approach, which consists of two new operators named \textit{Incremental Graph Convolution} (IGC) and \textit{Colliding Effect Distillation} (CED) to estimate the output of full graph convolution. In particular, we devise simple and effective modules for IGC to ingeniously combine the old representations and the incremental graph and effectively fuse the long-term and short-term preference signals. CED aims to avoid the out-of-date issue of inactive nodes that are not in the incremental graph, which connects the new data with inactive nodes through causal inference. In particular, CED estimates the causal effect of new data on the representation of inactive nodes through the control of their collider. Extensive experiments on three real-world datasets demonstrate both accuracy gains and significant speed-ups over the existing retraining mechanism.
    Sample Complexity versus Depth: An Information Theoretic Analysis. (arXiv:2203.00246v1 [cs.LG])
    Deep learning has proven effective across a range of data sets. In light of this, a natural inquiry is: "for what data generating processes can deep learning succeed?" In this work, we study the sample complexity of learning multilayer data generating processes of a sort for which deep neural networks seem to be suited. We develop general and elegant information-theoretic tools that accommodate analysis of any data generating process -- shallow or deep, parametric or nonparametric, noiseless or noisy. We then use these tools to characterize the dependence of sample complexity on the depth of multilayer processes. Our results indicate roughly linear dependence on depth. This is in contrast to previous results that suggest exponential or high-order polynomial dependence.
    Dual Embodied-Symbolic Concept Representations for Deep Learning. (arXiv:2203.00600v1 [cs.LG])
    Motivated by recent findings from cognitive neural science, we advocate the use of a dual-level model for concept representations: the embodied level consists of concept-oriented feature representations, and the symbolic level consists of concept graphs. Embodied concept representations are modality specific and exist in the form of feature vectors in a feature space. Symbolic concept representations, on the other hand, are amodal and language specific, and exist in the form of word / knowledge-graph embeddings in a concept / knowledge space. The human conceptual system comprises both embodied representations and symbolic representations, which typically interact to drive conceptual processing. As such, we further advocate the use of dual embodied-symbolic concept representations for deep learning. To demonstrate their usage and value, we discuss two important use cases: embodied-symbolic knowledge distillation for few-shot class incremental learning, and embodied-symbolic fused representation for image-text matching. Dual embodied-symbolic concept representations are the foundation for deep learning and symbolic AI integration. We discuss two important examples of such integration: scene graph generation with knowledge graph bridging, and multimodal knowledge graphs.
    Distributed randomized Kaczmarz for the adversarial workers. (arXiv:2203.00095v1 [cs.LG])
    Developing large-scale distributed methods that are robust to the presence of adversarial or corrupted workers is an important part of making such methods practical for real-world problems. Here, we propose an iterative approach that is adversary-tolerant for least-squares problems. The algorithm utilizes simple statistics to guarantee convergence and is capable of learning the adversarial distributions. Additionally, the efficiency of the proposed method is shown in simulations in the presence of adversaries. The results demonstrate the great capability of such methods to tolerate different levels of adversary rates and to identify the erroneous workers with high accuracy.
    Towards IID representation learning and its application on biomedical data. (arXiv:2203.00332v1 [cs.LG])
    Due to the heterogeneity of real-world data, the widely accepted independent and identically distributed (IID) assumption has been criticized in recent studies on causality. In this paper, we argue that instead of being a questionable assumption, IID is a fundamental task-relevant property that needs to be learned. Consider $k$ independent random vectors $\mathsf{X}^{i = 1, \ldots, k}$, we elaborate on how a variety of different causal questions can be reformulated to learning a task-relevant function $\phi$ that induces IID among $\mathsf{Z}^i := \phi \circ \mathsf{X}^i$, which we term IID representation learning. For proof of concept, we examine the IID representation learning on Out-of-Distribution (OOD) generalization tasks. Concretely, by utilizing the representation obtained via the learned function that induces IID, we conduct prediction of molecular characteristics (molecular prediction) on two biomedical datasets with real-world distribution shifts introduced by a) preanalytical variation and b) sampling protocol. To enable reproducibility and for comparison to the state-of-the-art (SOTA) methods, this is done by following the OOD benchmarking guidelines recommended from WILDS. Compared to the SOTA baselines supported in WILDS, the results confirm the superior performance of IID representation learning on OOD tasks. The code is publicly accessible via https://github.com/CTPLab/IID_representation_learning.
    Adversarial Attack Framework on Graph Embedding Models with Limited Knowledge. (arXiv:2105.12419v2 [cs.LG] UPDATED)
    With the success of the graph embedding model in both academic and industry areas, the robustness of graph embedding against adversarial attack inevitably becomes a crucial problem in graph learning. Existing works usually perform the attack in a white-box fashion: they need to access the predictions/labels to construct their adversarial loss. However, the inaccessibility of predictions/labels makes the white-box attack impractical to a real graph learning system. This paper promotes current frameworks in a more general and flexible sense -- we demand to attack various kinds of graph embedding models with black-box driven. We investigate the theoretical connections between graph signal processing and graph embedding models and formulate the graph embedding model as a general graph signal process with a corresponding graph filter. Therefore, we design a generalized adversarial attacker: GF-Attack. Without accessing any labels and model predictions, GF-Attack can perform the attack directly on the graph filter in a black-box fashion. We further prove that GF-Attack can perform an effective attack without knowing the number of layers of graph embedding models. To validate the generalization of GF-Attack, we construct the attacker on four popular graph embedding models. Extensive experiments validate the effectiveness of GF-Attack on several benchmark datasets.
    Towards Targeted Change Detection with Heterogeneous Remote Sensing Images for Forest Mortality Mapping. (arXiv:2203.00049v1 [cs.CV])
    In this paper we develop a method for mapping forest mortality in the forest-tundra ecotone using satellite data from heterogeneous sensors. We use medium resolution imagery in order to provide the complex pattern of forest mortality in this sparsely forested area, which has been induced by an outbreak of geometrid moths. Specifically, Landsat-5 Thematic Mapper images from before the event are used, with RADARSAT-2 providing the post-event images. We obtain the difference images for both multispectral optical and synthetic aperture radar (SAR) by using a recently developed deep learning method for translating between the two domains. These differences are stacked with the original pre- and post-event images in order to let our algorithm also learn how the areas appear before and after the change event. By doing this, and focusing on learning only the changes of interest with one-class classification (OCC), we obtain good results with very little training data.
    Attention-based Contextual Multi-View Graph Convolutional Networks for Short-term Population Prediction. (arXiv:2203.00489v1 [cs.LG])
    Short-term future population prediction is a crucial problem in urban computing. Accurate future population prediction can provide rich insights for urban planners or developers. However, predicting the future population is a challenging task due to its complex spatiotemporal dependencies. Many existing works have attempted to capture spatial correlations by partitioning a city into grids and using Convolutional Neural Networks (CNN). However, CNN merely captures spatial correlations by using a rectangle filter; it ignores urban environmental information such as distribution of railroads and location of POI. Moreover, the importance of those kinds of information for population prediction differs in each region and is affected by contextual situations such as weather conditions and day of the week. To tackle this problem, we propose a novel deep learning model called Attention-based Contextual Multi-View Graph Convolutional Networks (ACMV-GCNs). We first construct multiple graphs based on urban environmental information, and then ACMV-GCNs captures spatial correlations from various views with graph convolutional networks. Further, we add an attention module to consider the contextual situations when leveraging urban environmental information for future population prediction. Using statistics population count data collected through mobile phones, we demonstrate that our proposed model outperforms baseline methods. In addition, by visualizing weights calculated by an attention module, we show that our model learns an efficient way to utilize urban environment information without any prior knowledge.
    Layer Adaptive Deep Neural Networks for Out-of-distribution Detection. (arXiv:2203.00192v1 [cs.LG])
    During the forward pass of Deep Neural Networks (DNNs), inputs gradually transformed from low-level features to high-level conceptual labels. While features at different layers could summarize the important factors of the inputs at varying levels, modern out-of-distribution (OOD) detection methods mostly focus on utilizing their ending layer features. In this paper, we proposed a novel layer-adaptive OOD detection framework (LA-OOD) for DNNs that can fully utilize the intermediate layers' outputs. Specifically, instead of training a unified OOD detector at a fixed ending layer, we train multiple One-Class SVM OOD detectors simultaneously at the intermediate layers to exploit the full spectrum characteristics encoded at varying depths of DNNs. We develop a simple yet effective layer-adaptive policy to identify the best layer for detecting each potential OOD example. LA-OOD can be applied to any existing DNNs and does not require access to OOD samples during the training. Using three DNNs of varying depth and architectures, our experiments demonstrate that LA-OOD is robust against OODs of varying complexity and can outperform state-of-the-art competitors by a large margin on some real-world datasets.
    Don't miss the Mismatch: Investigating the Objective Function Mismatch for Unsupervised Representation Learning. (arXiv:2009.02383v2 [cs.CV] UPDATED)
    Finding general evaluation metrics for unsupervised representation learning techniques is a challenging open research question, which recently has become more and more necessary due to the increasing interest in unsupervised methods. Even though these methods promise beneficial representation characteristics, most approaches currently suffer from the objective function mismatch. This mismatch states that the performance on a desired target task can decrease when the unsupervised pretext task is learned too long - especially when both tasks are ill-posed. In this work, we build upon the widely used linear evaluation protocol and define new general evaluation metrics to quantitatively capture the objective function mismatch and the more generic metrics mismatch. We discuss the usability and stability of our protocols on a variety of pretext and target tasks and study mismatches in a wide range of experiments. Thereby we disclose dependencies of the objective function mismatch across several pretext and target tasks with respect to the pretext model's representation size, target model complexity, pretext and target augmentations as well as pretext and target task types. In our experiments, we find that the objective function mismatch reduces performance by ~0.1-5.0% for Cifar10, Cifar100 and PCam in many setups, and up to ~25-59% in extreme cases for the 3dshapes dataset.
    Disentangled Spatiotemporal Graph Generative Models. (arXiv:2203.00411v1 [cs.LG])
    Spatiotemporal graph represents a crucial data structure where the nodes and edges are embedded in a geometric space and can evolve dynamically over time. Nowadays, spatiotemporal graph data is becoming increasingly popular and important, ranging from microscale (e.g. protein folding), to middle-scale (e.g. dynamic functional connectivity), to macro-scale (e.g. human mobility network). Although disentangling and understanding the correlations among spatial, temporal, and graph aspects have been a long-standing key topic in network science, they typically rely on network processing hypothesized by human knowledge. This usually fit well towards the graph properties which can be predefined, but cannot do well for the most cases, especially for many key domains where the human has yet very limited knowledge such as protein folding and biological neuronal networks. In this paper, we aim at pushing forward the modeling and understanding of spatiotemporal graphs via new disentangled deep generative models. Specifically, a new Bayesian model is proposed that factorizes spatiotemporal graphs into spatial, temporal, and graph factors as well as the factors that explain the interplay among them. A variational objective function and new mutual information thresholding algorithms driven by information bottleneck theory have been proposed to maximize the disentanglement among the factors with theoretical guarantees. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed model over the state-of-the-arts by up to 69.2% for graph generation and 41.5% for interpretability.
    Non-linear manifold ROM with Convolutional Autoencoders and Reduced Over-Collocation method. (arXiv:2203.00360v1 [math.NA])
    Non-affine parametric dependencies, nonlinearities and advection-dominated regimes of the model of interest can result in a slow Kolmogorov n-width decay, which precludes the realization of efficient reduced-order models based on linear subspace approximations. Among the possible solutions, there are purely data-driven methods that leverage autoencoders and their variants to learn a latent representation of the dynamical system, and then evolve it in time with another architecture. Despite their success in many applications where standard linear techniques fail, more has to be done to increase the interpretability of the results, especially outside the training range and not in regimes characterized by an abundance of data. Not to mention that none of the knowledge on the physics of the model is exploited during the predictive phase. In order to overcome these weaknesses, we implement the non-linear manifold method introduced by Carlberg et al [37] with hyper-reduction achieved through reduced over-collocation and teacher-student training of a reduced decoder. We test the methodology on a 2d non-linear conservation law and a 2d shallow water models, and compare the results obtained with a purely data-driven method for which the dynamics is evolved in time with a long-short term memory network.
    On the Generalization of Representations in Reinforcement Learning. (arXiv:2203.00543v1 [cs.LG])
    In reinforcement learning, state representations are used to tractably deal with large problem spaces. State representations serve both to approximate the value function with few parameters, but also to generalize to newly encountered states. Their features may be learned implicitly (as part of a neural network) or explicitly (for example, the successor representation of \citet{dayan1993improving}). While the approximation properties of representations are reasonably well-understood, a precise characterization of how and when these representations generalize is lacking. In this work, we address this gap and provide an informative bound on the generalization error arising from a specific state representation. This bound is based on the notion of effective dimension which measures the degree to which knowing the value at one state informs the value at other states. Our bound applies to any state representation and quantifies the natural tension between representations that generalize well and those that approximate well. We complement our theoretical results with an empirical survey of classic representation learning methods from the literature and results on the Arcade Learning Environment, and find that the generalization behaviour of learned representations is well-explained by their effective dimension.
    Beyond Ans\"atze: Learning Quantum Circuits as Unitary Operators. (arXiv:2203.00601v1 [quant-ph])
    This paper explores the advantages of optimizing quantum circuits on $N$ wires as operators in the unitary group $U(2^N)$. We run gradient-based optimization in the Lie algebra $\mathfrak u(2^N)$ and use the exponential map to parametrize unitary matrices. We argue that $U(2^N)$ is not only more general than the search space induced by an ansatz, but in ways easier to work with on classical computers. The resulting approach is quick, ansatz-free and provides an upper bound on performance over all ans\"atze on $N$ wires.
    Pairwise Supervision Can Provably Elicit a Decision Boundary. (arXiv:2006.06207v2 [stat.ML] UPDATED)
    Similarity learning is a general problem to elicit useful representations by predicting the relationship between a pair of patterns. This problem is related to various important preprocessing tasks such as metric learning, kernel learning, and contrastive learning. A classifier built upon the representations is expected to perform well in downstream classification; however, little theory has been given in literature so far and thereby the relationship between similarity and classification has remained elusive. Therefore, we tackle a fundamental question: can similarity information provably leads a model to perform well in downstream classification? In this paper, we reveal that a product-type formulation of similarity learning is strongly related to an objective of binary classification. We further show that these two different problems are explicitly connected by an excess risk bound. Consequently, our results elucidate that similarity learning is capable of solving binary classification by directly eliciting a decision boundary.
    A Recurrent Differentiable Engine for Modeling Tensegrity Robots Trainable with Low-Frequency Data. (arXiv:2203.00041v1 [cs.RO])
    Tensegrity robots, composed of rigid rods and flexible cables, are difficult to accurately model and control given the presence of complex dynamics and high number of DoFs. Differentiable physics engines have been recently proposed as a data-driven approach for model identification of such complex robotic systems. These engines are often executed at a high-frequency to achieve accurate simulation. Ground truth trajectories for training differentiable engines, however, are not typically available at such high frequencies due to limitations of real-world sensors. The present work focuses on this frequency mismatch, which impacts the modeling accuracy. We proposed a recurrent structure for a differentiable physics engine of tensegrity robots, which can be trained effectively even with low-frequency trajectories. To train this new recurrent engine in a robust way, this work introduces relative to prior work: (i) a new implicit integration scheme, (ii) a progressive training pipeline, and (iii) a differentiable collision checker. A model of NASA's icosahedron SUPERballBot on MuJoCo is used as the ground truth system to collect training data. Simulated experiments show that once the recurrent differentiable engine has been trained given the low-frequency trajectories from MuJoCo, it is able to match the behavior of MuJoCo's system. The criterion for success is whether a locomotion strategy learned using the differentiable engine can be transferred back to the ground-truth system and result in a similar motion. Notably, the amount of ground truth data needed to train the differentiable engine, such that the policy is transferable to the ground truth system, is 1% of the data needed to train the policy directly on the ground-truth system.
    Estimating causal effects with optimization-based methods: A review and empirical comparison. (arXiv:2203.00097v1 [stat.ME])
    In the absence of randomized controlled and natural experiments, it is necessary to balance the distributions of (observable) covariates of the treated and control groups in order to obtain an unbiased estimate of a causal effect of interest; otherwise, a different effect size may be estimated, and incorrect recommendations may be given. To achieve this balance, there exist a wide variety of methods. In particular, several methods based on optimization models have been recently proposed in the causal inference literature. While these optimization-based methods empirically showed an improvement over a limited number of other causal inference methods in their relative ability to balance the distributions of covariates and to estimate causal effects, they have not been thoroughly compared to each other and to other noteworthy causal inference methods. In addition, we believe that there exist several unaddressed opportunities that operational researchers could contribute with their advanced knowledge of optimization, for the benefits of the applied researchers that use causal inference tools. In this review paper, we present an overview of the causal inference literature and describe in more detail the optimization-based causal inference methods, provide a comparative analysis of the prevailing optimization-based methods, and discuss opportunities for new methods.
    Identifying charge density and dielectric environment of graphene using Raman spectroscopy and deep learning. (arXiv:2203.00431v1 [cs.LG])
    The impact of the environment on graphene's properties such as strain, charge density, and dielectric environment can be evaluated by Raman spectroscopy. These environmental interactions are not trivial to determine, since they affect the spectra in overlapping ways. Data preprocessing such as background subtraction and peak fitting is typically used. Moreover, collected spectroscopic data vary due to different experimental setups and environments. Such variations, artifacts, and environmental differences pose a challenge in accurate spectral analysis. In this work, we developed a deep learning model to overcome the effects of such variations and classify graphene Raman spectra according to different charge densities and dielectric environments. We consider two approaches: deep learning models and machine learning algorithms to classify spectra with slightly different charge density or dielectric environment. These two approaches show similar success rates for high Signal-to-Noise data. However, deep learning models are less sensitive to noise. To improve the accuracy and generalization of all models, we use data augmentation through additive noise and peak shifting. We demonstrated the spectra classification with 99% accuracy using a convolutional neural net (CNN) model. The CNN model is able to classify Raman spectra of graphene with different charge doping levels and even subtle variation in the spectra between graphene on SiO$_2$ and graphene on silanized SiO$_2$. Our approach has the potential for fast and reliable estimation of graphene doping levels and dielectric environments. The proposed model paves the way for achieving efficient analytical tools to evaluate the properties of graphene.
    DreamingV2: Reinforcement Learning with Discrete World Models without Reconstruction. (arXiv:2203.00494v1 [cs.LG])
    The present paper proposes a novel reinforcement learning method with world models, DreamingV2, a collaborative extension of DreamerV2 and Dreaming. DreamerV2 is a cutting-edge model-based reinforcement learning from pixels that uses discrete world models to represent latent states with categorical variables. Dreaming is also a form of reinforcement learning from pixels that attempts to avoid the autoencoding process in general world model training by involving a reconstruction-free contrastive learning objective. The proposed DreamingV2 is a novel approach of adopting both the discrete representation of DreamingV2 and the reconstruction-free objective of Dreaming. Compared to DreamerV2 and other recent model-based methods without reconstruction, DreamingV2 achieves the best scores on five simulated challenging 3D robot arm tasks. We believe that DreamingV2 will be a reliable solution for robot learning since its discrete representation is suitable to describe discontinuous environments, and the reconstruction-free fashion well manages complex vision observations.
    Optimal quantum dataset for learning a unitary transformation. (arXiv:2203.00546v1 [quant-ph])
    Unitary transformations formulate the time evolution of quantum states. How to learn a unitary transformation efficiently is a fundamental problem in quantum machine learning. The most natural and leading strategy is to train a quantum machine learning model based on a quantum dataset. Although presence of more training data results in better models, using too much data reduces the efficiency of training. In this work, we solve the problem on the minimum size of sufficient quantum datasets for learning a unitary transformation exactly, which reveals the power and limitation of quantum data. First, we prove that the minimum size of dataset with pure states is $2^n$ for learning an $n$-qubit unitary transformation. To fully explore the capability of quantum data, we introduce a quantum dataset consisting of $n+1$ mixed states that are sufficient for exact training. The main idea is to simplify the structure utilizing decoupling, which leads to an exponential improvement on the size over the datasets with pure states. Furthermore, we show that the size of quantum dataset with mixed states can be reduced to a constant, which yields an optimal quantum dataset for learning a unitary. We showcase the applications of our results in oracle compiling and Hamiltonian simulation. Notably, to accurately simulate a 3-qubit one-dimensional nearest-neighbor Heisenberg model, our circuit only uses $48$ elementary quantum gates, which is significantly less than $4320$ gates in the circuit constructed by the Trotter-Suzuki product formula.
    Bayesian Optimisation for Robust Model Predictive Control under Model Parameter Uncertainty. (arXiv:2203.00551v1 [cs.RO])
    We propose an adaptive optimisation approach for tuning stochastic model predictive control (MPC) hyper-parameters while jointly estimating probability distributions of the transition model parameters based on performance rewards. In particular, we develop a Bayesian optimisation (BO) algorithm with a heteroscedastic noise model to deal with varying noise across the MPC hyper-parameter and dynamics model parameter spaces. Typical homoscedastic noise models are unrealistic for tuning MPC since stochastic controllers are inherently noisy, and the level of noise is affected by their hyper-parameter settings. We evaluate the proposed optimisation algorithm in simulated control and robotics tasks where we jointly infer control and dynamics parameters. Experimental results demonstrate that our approach leads to higher cumulative rewards and more stable controllers.
    Variational Autoencoders Without the Variation. (arXiv:2203.00645v1 [cs.LG])
    Variational autoencdoers (VAE) are a popular approach to generative modelling. However, exploiting the capabilities of VAEs in practice can be difficult. Recent work on regularised and entropic autoencoders have begun to explore the potential, for generative modelling, of removing the variational approach and returning to the classic deterministic autoencoder (DAE) with additional novel regularisation methods. In this paper we empirically explore the capability of DAEs for image generation without additional novel methods and the effect of the implicit regularisation and smoothness of large networks. We find that DAEs can be used successfully for image generation without additional loss terms, and that many of the useful properties of VAEs can arise implicitly from sufficiently large convolutional encoders and decoders when trained on CIFAR-10 and CelebA.
    Provable Lifelong Learning of Representations. (arXiv:2110.14098v2 [cs.LG] UPDATED)
    In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension $d$, $m$ tasks with $k$ features up to error $\epsilon$ is $\tilde{O}(dk^{1.5}/\epsilon+km/\epsilon)$. We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.
    Learning to Navigate Intersections with Unsupervised Driver Trait Inference. (arXiv:2109.06783v2 [cs.RO] UPDATED)
    Navigation through uncontrolled intersections is one of the key challenges for autonomous vehicles. Identifying the subtle differences in hidden traits of other drivers can bring significant benefits when navigating in such environments. We propose an unsupervised method for inferring driver traits such as driving styles from observed vehicle trajectories. We use a variational autoencoder with recurrent neural networks to learn a latent representation of traits without any ground truth trait labels. Then, we use this trait representation to learn a policy for an autonomous vehicle to navigate through a T-intersection with deep reinforcement learning. Our pipeline enables the autonomous vehicle to adjust its actions when dealing with drivers of different traits to ensure safety and efficiency. Our method demonstrates promising performance and outperforms state-of-the-art baselines in the T-intersection scenario.
    Automatic Depression Detection via Learning and Fusing Features from Visual Cues. (arXiv:2203.00304v1 [cs.LG])
    Depression is one of the most prevalent mental disorders, which seriously affects one's life. Traditional depression diagnostics commonly depends on rating with scales, which can be labor-intensive and subjective. In this context, Automatic Depression Detection (ADD) has been attracting more attention for its low cost and objectivity. ADD systems are able to detect depression automatically from some medical records, like video sequences. However, it remains challenging to effectively extract depression-specific information from long sequences, thereby hindering a satisfying accuracy. In this paper, we propose a novel ADD method via learning and fusing features from visual cues. Specifically, we firstly construct Temporal Dilated Convolutional Network (TDCN), in which multiple Dilated Convolution Blocks (DCB) are designed and stacked, to learn the long-range temporal information from sequences. Then, the Feature-Wise Attention (FWA) module is adopted to fuse different features extracted from TDCNs. The module learns to assign weights for the feature channels, aiming to better incorporate different kinds of visual features and further enhance the detection accuracy. Our method achieves the state-of-the-art performance on the DAIC_WOZ dataset compared to other visual-feature-based methods, showing its effectiveness.
    Multi-Objective Latent Space Optimization of Generative Molecular Design Models. (arXiv:2203.00526v1 [cs.LG])
    Molecular design based on generative models, such as variational autoencoders (VAEs), has become increasingly popular in recent years due to its efficiency for exploring high-dimensional molecular space to identify molecules with desired properties. While the efficacy of the initial model strongly depends on the training data, the sampling efficiency of the model for suggesting novel molecules with enhanced properties can be further enhanced via latent space optimization. In this paper, we propose a multi-objective latent space optimization (LSO) method that can significantly enhance the performance of generative molecular design (GMD). The proposed method adopts an iterative weighted retraining approach, where the respective weights of the molecules in the training data are determined by their Pareto efficiency. We demonstrate that our multi-objective GMD LSO method can significantly improve the performance of GMD for jointly optimizing multiple molecular properties.
    On genetic programming representations and fitness functions for interpretable dimensionality reduction. (arXiv:2203.00528v1 [cs.NE])
    Dimensionality reduction (DR) is an important technique for data exploration and knowledge discovery. However, most of the main DR methods are either linear (e.g., PCA), do not provide an explicit mapping between the original data and its lower-dimensional representation (e.g., MDS, t-SNE, isomap), or produce mappings that cannot be easily interpreted (e.g., kernel PCA, neural-based autoencoder). Recently, genetic programming (GP) has been used to evolve interpretable DR mappings in the form of symbolic expressions. There exists a number of ways in which GP can be used to this end and no study exists that performs a comparison. In this paper, we fill this gap by comparing existing GP methods as well as devising new ones. We evaluate our methods on several benchmark datasets based on predictive accuracy and on how well the original features can be reconstructed using the lower-dimensional representation only. Finally, we qualitatively assess the resulting expressions and their complexity. We find that various GP methods can be competitive with state-of-the-art DR algorithms and that they have the potential to produce interpretable DR mappings.
    A Deep Bayesian Neural Network for Cardiac Arrhythmia Classification with Rejection from ECG Recordings. (arXiv:2203.00512v1 [eess.SP])
    With the development of deep learning-based methods, automated classification of electrocardiograms (ECGs) has recently gained much attention. Although the effectiveness of deep neural networks has been encouraging, the lack of information given by the outputs restricts clinicians' reexamination. If the uncertainty estimation comes along with the classification results, cardiologists can pay more attention to "uncertain" cases. Our study aims to classify ECGs with rejection based on data uncertainty and model uncertainty. We perform experiments on a real-world 12-lead ECG dataset. First, we estimate uncertainties using the Monte Carlo dropout for each classification prediction, based on our Bayesian neural network. Then, we accept predictions with uncertainty under a given threshold and provide "uncertain" cases for clinicians. Furthermore, we perform a simulation experiment using varying thresholds. Finally, with the help of a clinician, we conduct case studies to explain the results of large uncertainties and incorrect predictions with small uncertainties. The results show that correct predictions are more likely to have smaller uncertainties, and the performance on accepted predictions improves as the accepting ratio decreases (i.e. more rejections). Case studies also help explain why rejection can improve the performance. Our study helps neural networks produce more accurate results and provide information on uncertainties to better assist clinicians in the diagnosis process. It can also enable deep-learning-based ECG interpretation in clinical implementation.
    Physics-Informed Neural Networks for Quantum Eigenvalue Problems. (arXiv:2203.00451v1 [cs.LG])
    Eigenvalue problems are critical to several fields of science and engineering. We expand on the method of using unsupervised neural networks for discovering eigenfunctions and eigenvalues for differential eigenvalue problems. The obtained solutions are given in an analytical and differentiable form that identically satisfies the desired boundary conditions. The network optimization is data-free and depends solely on the predictions of the neural network. We introduce two physics-informed loss functions. The first, called ortho-loss, motivates the network to discover pair-wise orthogonal eigenfunctions. The second loss term, called norm-loss, requests the discovery of normalized eigenfunctions and is used to avoid trivial solutions. We find that embedding even or odd symmetries to the neural network architecture further improves the convergence for relevant problems. Lastly, a patience condition can be used to automatically recognize eigenfunction solutions. This proposed unsupervised learning method is used to solve the finite well, multiple finite wells, and hydrogen atom eigenvalue quantum problems.
    Smoothed Differential Privacy. (arXiv:2107.01559v3 [cs.CR] UPDATED)
    Differential privacy (DP) is a widely-accepted and widely-applied notion of privacy based on worst-case analysis. Often, DP classifies most mechanisms without external noise as non-private [Dwork et al., 2014], and external noises, such as Gaussian noise or Laplacian noise [Dwork et al., 2006], are introduced to improve privacy. In many real-world applications, however, adding external noise is undesirable and sometimes prohibited. For example, presidential elections often require a deterministic rule to be used [Liu et al., 2020], and small noises can lead to dramatic decreases in the prediction accuracy of deep neural networks, especially the underrepresented classes [Bagdasaryan et al., 2019]. In this paper, we propose a natural extension and relaxation of DP following the worst average-case idea behind the celebrated smoothed analysis [Spielman and Teng, 2004]. Our notion, the smoothed DP, can effectively measure the privacy leakage of mechanisms without external noises under realistic settings. We prove several strong properties of the smoothed DP, including composability, robustness to post-processing and etc. We proved that any discrete mechanism with sampling procedures is more private than what DP predicts. In comparison, many continuous mechanisms with sampling procedures are still non-private under smoothed DP. Experimentally, we first verified that the discrete sampling mechanisms are private in real-world elections. Then, we apply the smoothed DP notion on quantized gradient descent, which indicates some neural networks can be private without adding any extra noises. We believe that these results contribute to the theoretical foundation of realistic privacy measures beyond worst-case analysis.
    DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search. (arXiv:2107.02736v2 [cs.DS] UPDATED)
    Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for kernel density estimation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.
    SPEAR : Semi-supervised Data Programming in Python. (arXiv:2108.00373v2 [cs.LG] UPDATED)
    We present \spear, an open-source python library for data programming with semi supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data. SPEAR facilitates \textit{weak supervision} in the form of heuristics (or rules) and association of \textit{noisy} labels to the training dataset. These \textit{noisy} labels are aggregated to assign labels to the unlabeled data for downstream tasks. We have implemented several label aggregation approaches that aggregate the \textit{noisy} labels and then train using the \textit{noisily} labeled set in a cascaded manner. Our implementation also includes other approaches that \textit{jointly} aggregate and train the model for text classification tasks. Thus, in our python package, we integrate several cascade and joint data-programming approaches while also providing the facility of data programming by letting the user define labeling functions or rules. The code and tutorial notebooks are available at \url{https://github.com/decile-team/spear}. Further, extensive documentation can be found at \url{https://spear-decile.readthedocs.io/}. Video tutorials demonstrating the usage of our package are available at https://youtu.be/SN9YYK4FlU0, https://www.youtube.com/watch?v=qdukvO3B8YU, and https://drive.google.com/uc?export=download&confirm=KY-u&id=1P0iOGHrIR1Te0sSeCB3hwdPpjzB44K33. We also present some real-world use cases of \spear.
    Causal Structure Learning with Greedy Unconditional Equivalence Search. (arXiv:2203.00521v1 [stat.ML])
    We consider the problem of characterizing directed acyclic graph (DAG) models up to unconditional equivalence, i.e., when two DAGs have the same set of unconditional d-separation statements. Each unconditional equivalence class (UEC) can be uniquely represented with an undirected graph whose clique structure encodes the members of the class. Via this structure, we provide a transformational characterization of unconditional equivalence. Combining these results, we introduce a hybrid algorithm for learning DAG models from observational data, called Greedy Unconditional Equivalence Search (GUES), which first estimates the UEC of the data using independence tests and then greedily searches the UEC for the optimal DAG. Applying GUES on synthetic data, we show that it achieves comparable accuracy to existing methods. However, in contrast to existing methods, since the average UEC is observed to contain few DAGs, the search space for GUES is drastically reduced.
    Machine Learning for Particle Flow Reconstruction at CMS. (arXiv:2203.00330v1 [physics.data-an])
    We provide details on the implementation of a machine-learning based particle flow algorithm for CMS. The standard particle flow algorithm reconstructs stable particles based on calorimeter clusters and tracks to provide a global event reconstruction that exploits the combined information of multiple detector subsystems, leading to strong improvements for quantities such as jets and missing transverse energy. We have studied a possible evolution of particle flow towards heterogeneous computing platforms such as GPUs using a graph neural network. The machine-learned PF model reconstructs particle candidates based on the full list of tracks and calorimeter clusters in the event. For validation, we determine the physics performance directly in the CMS software framework when the proposed algorithm is interfaced with the offline reconstruction of jets and missing transverse energy. We also report the computational performance of the algorithm, which scales approximately linearly in runtime and memory usage with the input size.
    Particle-based Fast Jet Simulation at the LHC with Variational Autoencoders. (arXiv:2203.00520v1 [physics.comp-ph])
    We study how to use Deep Variational Autoencoders for a fast simulation of jets of particles at the LHC. We represent jets as a list of constituents, characterized by their momenta. Starting from a simulation of the jet before detector effects, we train a Deep Variational Autoencoder to return the corresponding list of constituents after detection. Doing so, we bypass both the time-consuming detector simulation and the collision reconstruction steps of a traditional processing chain, speeding up significantly the events generation workflow. Through model optimization and hyperparameter tuning, we achieve state-of-the-art precision on the jet four-momentum, while providing an accurate description of the constituents momenta, and an inference time comparable to that of a rule-based fast simulation.
    Towards a Common Speech Analysis Engine. (arXiv:2203.00613v1 [cs.CL])
    Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine should be able to handle multiple speech processing tasks, using a single architecture, to obtain state-of-the-art accuracy. The engine must also enable support for new tasks with small training datasets. Beyond that, a common engine should be capable of supporting distributed training with client in-house private data. We present the architecture for a common speech analysis engine based on the HuBERT self-supervised speech representation. Based on experiments, we report our results for language identification and emotion recognition on the standard evaluations NIST-LRE 07 and IEMOCAP. Our results surpass the state-of-the-art performance reported so far on these tasks. We also analyzed our engine on the emotion recognition task using reduced amounts of training data and show how to achieve improved results.
    Generative Adversarial Networks. (arXiv:2203.00667v1 [cs.CV])
    Generative Adversarial Networks (GANs) are very popular frameworks for generating high-quality data, and are immensely used in both the academia and industry in many domains. Arguably, their most substantial impact has been in the area of computer vision, where they achieve state-of-the-art image generation. This chapter gives an introduction to GANs, by discussing their principle mechanism and presenting some of their inherent problems during training and evaluation. We focus on these three issues: (1) mode collapse, (2) vanishing gradients, and (3) generation of low-quality images. We then list some architecture-variant and loss-variant GANs that remedy the above challenges. Lastly, we present two utilization examples of GANs for real-world applications: Data augmentation and face images generation.
    How certain are your uncertainties?. (arXiv:2203.00238v1 [cs.LG])
    Having a measure of uncertainty in the output of a deep learning method is useful in several ways, such as in assisting with interpretation of the outputs, helping build confidence with end users, and for improving the training and performance of the networks. Therefore, several different methods have been proposed to capture various types of uncertainty, including epistemic (relating to the model used) and aleatoric (relating to the data) sources, with the most commonly used methods for estimating these being test-time dropout for epistemic uncertainty and test-time augmentation for aleatoric uncertainty. However, these methods are parameterised (e.g. amount of dropout or type and level of augmentation) and so there is a whole range of possible uncertainties that could be calculated, even with a fixed network and dataset. This work investigates the stability of these uncertainty measurements, in terms of both magnitude and spatial pattern. In experiments using the well characterised BraTS challenge, we demonstrate substantial variability in the magnitude and spatial pattern of these uncertainties, and discuss the implications for interpretability, repeatability and confidence in results.
    A comparative study of several parameterizations for speaker recognition. (arXiv:2203.00513v1 [cs.SD])
    This paper presents an exhaustive study about the robustness of several parameterizations, in speaker verification and identification tasks. We have studied several mismatch conditions: different recording sessions, microphones, and different languages (it has been obtained from a bilingual set of speakers). This study reveals that the combination of several parameterizations can improve the robustness in all the scenarios for both tasks, identification and verification. In addition, two different methods have been evaluated: vector quantization, and covariance matrices with an arithmetic-harmonic sphericity measure.
    LISA: Learning Interpretable Skill Abstractions from Language. (arXiv:2203.00054v1 [cs.LG])
    Learning policies that effectually utilize language instructions in complex, multi-task environments is an important problem in imitation learning. While it is possible to condition on the entire language instruction directly, such an approach could suffer from generalization issues. To encode complex instructions into skills that can generalize to unseen instructions, we propose Learning Interpretable Skill Abstractions (LISA), a hierarchical imitation learning framework that can learn diverse, interpretable skills from language-conditioned demonstrations. LISA uses vector quantization to learn discrete skill codes that are highly correlated with language instructions and the behavior of the learned policy. In navigation and robotic manipulation environments, LISA is able to outperform a strong non-hierarchical baseline in the low data regime and compose learned skills to solve tasks containing unseen long-range instructions. Our method demonstrates a more natural way to condition on language in sequential decision-making problems and achieve interpretable and controllable behavior with the learned skills.
    E-LMC: Extended Linear Model of Coregionalization for Predictions of Spatial Fields. (arXiv:2203.00525v1 [cs.LG])
    Physical simulations based on partial differential equations typically generate spatial fields results, which are utilized to calculate specific properties of a system for engineering design and optimization. Due to the intensive computational burden of the simulations, a surrogate model mapping the low-dimensional inputs to the spatial fields are commonly built based on a relatively small dataset. To resolve the challenge of predicting the whole spatial field, the popular linear model of coregionalization (LMC) can disentangle complicated correlations within the high-dimensional spatial field outputs and deliver accurate predictions. However, LMC fails if the spatial field cannot be well approximated by a linear combination of base functions with latent processes. In this paper, we extend LMC by introducing an invertible neural network to linearize the highly complex and nonlinear spatial fields such that the LMC can easily generalize to nonlinear problems while preserving the traceability and scalability. Several real-world applications demonstrate that E-LMC can exploit spatial correlations effectively, showing a maximum improvement of about 40% over the original LMC and outperforming the other state-of-the-art spatial field models.
    Mental State Classification Using Multi-graph Features. (arXiv:2203.00516v1 [eess.SP])
    We consider the problem of extracting features from passive, multi-channel electroencephalogram (EEG) devices for downstream inference tasks related to high-level mental states such as stress and cognitive load. Our proposed method leverages recently developed multi-graph tools and applies them to the time series of graphs implied by the statistical dependence structure (e.g., correlation) amongst the multiple sensors. We compare the effectiveness of the proposed features to traditional band power-based features in the context of three classification experiments and find that the two feature sets offer complementary predictive information. We conclude by showing that the importance of particular channels and pairs of channels for classification when using the proposed features is neuroscientifically valid.
    Divergence Regulated Encoder Network for Joint Dimensionality Reduction and Classification. (arXiv:2012.15764v5 [cs.LG] UPDATED)
    Feature representation is an important aspect of remote-sensing based image classification. While deep convolutional neural networks are able to effectively amalgamate information, large numbers of parameters often make learned features inscrutable and difficult to transfer to alternative models. In order to better represent statistical texture information for remote-sensing image classification, in this paper, we investigate performing joint dimensionality reduction and classification using a novel histogram neural network. Motivated by a popular dimensionality reduction approach, t-Distributed Stochastic Neighbor Embedding (t-SNE), our proposed method incorporates a classification loss computed on samples in a low-dimensional embedding space. We compare the learned sample embeddings against coordinates found by t-SNE in terms of classification accuracy and qualitative assessment. We also explore use of various divergence measures in the t-SNE objective. The proposed method has several advantages such as readily embedding out-of-sample points and reducing feature dimensionality while retaining class discriminability. Our results show that the proposed approach maintains and/or improves classification performance and reveals characteristics of features produced by neural networks that may be helpful for other applications.
    Differentiating and Integrating ZX Diagrams. (arXiv:2201.13250v2 [quant-ph] UPDATED)
    ZX-calculus has proved to be a useful tool for quantum technology with a wide range of successful applications. Most of these applications are of an algebraic nature. However, other tasks that involve differentiation and integration remain unreachable with current ZX techniques. Here we elevate ZX to an analytical perspective by realising differentiation and integration entirely within the framework of ZX-calculus. We explicitly illustrate the new analytic framework of ZX-calculus by applying it in context of quantum machine learning.
    A predictive analytics approach for stroke prediction using machine learning and neural networks. (arXiv:2203.00497v1 [cs.LG])
    The negative impact of stroke in society has led to concerted efforts to improve the management and diagnosis of stroke. With an increased synergy between technology and medical diagnosis, caregivers create opportunities for better patient management by systematically mining and archiving the patients' medical records. Therefore, it is vital to study the interdependency of these risk factors in patients' health records and understand their relative contribution to stroke prediction. This paper systematically analyzes the various factors in electronic health records for effective stroke prediction. Using various statistical techniques and principal component analysis, we identify the most important factors for stroke prediction. We conclude that age, heart disease, average glucose level, and hypertension are the most important factors for detecting stroke in patients. Furthermore, a perceptron neural network using these four attributes provides the highest accuracy rate and lowest miss rate compared to using all available input features and other benchmarking algorithms. As the dataset is highly imbalanced concerning the occurrence of stroke, we report our results on a balanced dataset created via sub-sampling techniques.
    Molecular Dynamics of Polymer-lipids in Solution from Supervised Machine Learning. (arXiv:2203.00151v1 [cond-mat.stat-mech])
    Machine learning techniques including neural networks are popular tools for materials and chemical scientists with applications that may provide viable alternative methods in the analysis of structure and energetics of systems ranging from crystals to biomolecules. However, efforts are less abundant for prediction of dynamics. Here we explore the ability of three well established recurrent neural network architectures for forecasting the energetics of a macromolecular polymer-lipid aggregate solvated in ethyl acetate at ambient conditions. Data models generated from recurrent neural networks are trained and tested on nanoseconds-long time series of the intra-macromolecules potential energy and their interaction energy with the solvent generated from Molecular Dynamics and containing half million points. Our exhaustive analyses convey that the three recurrent neural network investigated generate data models with limited capability of reproducing the energetic fluctuations and yielding short or long term energetics forecasts with underlying distribution of points inconsistent with the input series distributions. We propose an in silico experimental protocol consisting on forming an ensemble of artificial network models trained on an ensemble of series with additional features from time series containing pre-clustered time patterns of the original series. The forecast process improves by predicting a band of forecasted time series with a spread of values consistent with the molecular dynamics energy fluctuations span. However, the distribution of points from the band of forecasts is not optimal. Although the three inspected recurrent neural networks were unable of generating single models that reproduce the actual fluctuations of the inspected molecular system energies in thermal equilibrium at the nanosecond scale, the proposed protocol provides useful estimates of the molecular fate
    On Testability of the Front-Door Model via Verma Constraints. (arXiv:2203.00161v1 [stat.ME])
    The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions -- (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair -- are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding.
    FedREP: Towards Horizontal Federated Load Forecasting for Retail Energy Providers. (arXiv:2203.00219v1 [cs.DC])
    As Smart Meters are collecting and transmitting household energy consumption data to Retail Energy Providers (REP), the main challenge is to ensure the effective use of fine-grained consumer data while ensuring data privacy. In this manuscript, we tackle this challenge for energy load consumption forecasting in regards to REPs which is essential to energy demand management, load switching and infrastructure development. Specifically, we note that existing energy load forecasting is centralized, which are not scalable and most importantly, vulnerable to data privacy threats. Besides, REPs are individual market participants and liable to ensure the privacy of their own customers. To address this issue, we propose a novel horizontal privacy-preserving federated learning framework for REPs energy load forecasting, namely FedREP. We consider a federated learning system consisting of a control centre and multiple retailers by enabling multiple REPs to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security and scalability. For forecasting, we use a state-of-the-art Long Short-Term Memory (LSTM) neural network due to its ability to learn long term sequences of observations and promises of higher accuracy with time-series data while solving the vanishing gradient problem. Finally, we conduct extensive data-driven experiments using a real energy consumption dataset. Experimental results demonstrate that our proposed federated learning framework can achieve sufficient performance in terms of MSE ranging between 0.3 to 0.4 and is relatively similar to that of a centralized approach while preserving privacy and improving scalability.
    AI-based approach for improving the detection of blood doping in sports. (arXiv:2203.00001v1 [cs.LG])
    Sports officials around the world are facing incredible challenges due to the unfair means of practices performed by the athletes to improve their performance in the game. It includes the intake of hormonal based drugs or transfusion of blood to increase their strength and the result of their training. However, the current direct test of detection of these cases includes the laboratory-based method, which is limited because of the cost factors, availability of medical experts, etc. This leads us to seek for indirect tests. With the growing interest of Artificial Intelligence in healthcare, it is important to propose an algorithm based on blood parameters to improve decision making. In this paper, we proposed a statistical and machine learning-based approach to identify the presence of doping substance rhEPO in blood samples.
    Investigating the Spatiotemporal Charging Demand and Travel Behavior of Electric Vehicles Using GPS Data: A Machine Learning Approach. (arXiv:2203.00135v1 [eess.SY])
    The increasing market penetration of electric vehicles (EVs) may change the travel behavior of drivers and pose a significant electricity demand on the power system. Since the electricity demand depends on the travel behavior of EVs, which are inherently uncertain, the forecasting of daily charging demand (CD) will be a challenging task. In this paper, we use the recorded GPS data of EVs and conventional gasoline-powered vehicles from the same city to investigate the potential shift in the travel behavior of drivers from conventional vehicles to EVs and forecast the spatiotemporal patterns of daily CD. Our analysis reveals that the travel behavior of EVs and conventional vehicles are similar. Also, the forecasting results indicate that the developed models can generate accurate spatiotemporal patterns of the daily CD.
    Data-efficient learning of object-centric grasp preferences. (arXiv:2203.00384v1 [cs.RO])
    Grasping made impressive progress during the last few years thanks to deep learning. However, there are many objects for which it is not possible to choose a grasp by only looking at an RGB-D image, might it be for physical reasons (e.g., a hammer with uneven mass distribution) or task constraints (e.g., food that should not be spoiled). In such situations, the preferences of experts need to be taken into account. In this paper, we introduce a data-efficient grasping pipeline (Latent Space GP Selector -- LGPS) that learns grasp preferences with only a few labels per object (typically 1 to 4) and generalizes to new views of this object. Our pipeline is based on learning a latent space of grasps with a dataset generated with any state-of-the-art grasp generator (e.g., Dex-Net). This latent space is then used as a low-dimensional input for a Gaussian process classifier that selects the preferred grasp among those proposed by the generator. The results show that our method outperforms both GR-ConvNet and GG-CNN (two state-of-the-art methods that are also based on labeled grasps) on the Cornell dataset, especially when only a few labels are used: only 80 labels are enough to correctly choose 80% of the grasps (885 scenes, 244 objects). Results are similar on our dataset (91 scenes, 28 objects).
    Parameter estimation for WMTI-Watson model of white matter using encoder-decoder recurrent neural network. (arXiv:2203.00595v1 [physics.med-ph])
    Biophysical modelling of the diffusion MRI signal provides estimates of specific microstructural tissue properties.Although nonlinear optimization such as non-linear least squares (NLLS) is the most widespread method for model estimation, it suffers from local minima and high computational cost. Deep Learning approaches are steadily replacing NL fitting, but come with the limitation that the model needs to be retrained for each acquisition protocol and noise level. The White Matter Tract Integrity (WMTI)-Watson model was proposed as an implementation of the Standard Model of diffusion in white matter that estimates model parameters from the diffusion and kurtosis tensors (DKI). Here we proposed a deep learning approach based on the encoder-decoder recurrent neural network (RNN) to increase the robustness and accelerate the parameter estimation of WMTI-Watson. We use an embedding approach to render the model insensitive to potential differences in distributions between training data and experimental data. This RNN-based solver thus has the advantage of being highly efficient in computation and more readily translatable to other datasets, irrespective of acquisition protocol and underlying parameter distributions as long as DKI was pre-computed from the data. In this study, we evaluated the performance of NLLS, the RNN-based method and a multilayer perceptron (MLP) on synthetic and in vivo datasets of rat and human brain. We showed that the proposed RNN-based fitting approach had the advantage of highly reduced computation time over NLLS (from hours to seconds), with similar accuracy and precision but improved robustness, and superior translatability to new datasets over MLP.
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v1 [cs.LG])
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).
    Neural Score Matching for High-Dimensional Causal Inference. (arXiv:2203.00554v1 [stat.ML])
    Traditional methods for matching in causal inference are impractical for high-dimensional datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact matching find exponentially fewer matches as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate the use of neural networks to obtain non-trivial, multivariate balancing scores of a chosen level of coarseness, in contrast to the classical, scalar propensity score. We leverage these balancing scores to perform matching for high-dimensional causal inference and call this procedure neural score matching. We show that our method is competitive against other matching approaches on semi-synthetic high-dimensional datasets, both in terms of treatment effect estimation and reducing imbalance.
    Local and Global GANs with Semantic-Aware Upsampling for Image Generation. (arXiv:2203.00047v1 [cs.CV])
    In this paper, we address the task of semantic-guided image generation. One challenge common to most existing image-level generation methods is the difficulty in generating small objects and detailed local textures. To address this, in this work we consider generating images using local context. As such, we design a local class-specific generative network using semantic maps as guidance, which separately constructs and learns subgenerators for different classes, enabling it to capture finer details. To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module. To combine the advantages of both global image-level and local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Lastly, we propose a novel semantic-aware upsampling method, which has a larger receptive field and can take far-away pixels that are semantically related for feature upsampling, enabling it to better preserve semantic consistency for instances with the same semantic labels. Extensive experiments on two image generation tasks show the superior performance of the proposed method. State-of-the-art results are established by large margins on both tasks and on nine challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.
    The Concordance Index decomposition -- A measure for a deeper understanding of survival prediction models. (arXiv:2203.00144v1 [cs.LG])
    The Concordance Index (C-index) is a commonly used metric in Survival Analysis to evaluate how good a prediction model is. This paper proposes a decomposition of the C-Index into a weighted harmonic mean of two quantities: one for ranking observed events versus other observed events, and the other for ranking observed events versus censored cases. This decomposition allows a more fine-grained analysis of the pros and cons of survival prediction methods. The utility of the decomposition is demonstrated using three benchmark survival analysis models (Cox Proportional Hazard, Random Survival Forest, and Deep Adversarial Time-to-Event Network) together with a new variational generative neural-network-based method (SurVED), which is also proposed in this paper. The demonstration is done on four publicly available datasets with varying censoring levels. The analysis with the C-index decomposition shows that all methods essentially perform equally well when the censoring level is high because of the dominance of the term measuring the ranking of events versus censored cases. In contrast, some methods deteriorate when the censoring level decreases because they do not rank the events versus other events well.
    Addressing Randomness in Evaluation Protocols for Out-of-Distribution Detection. (arXiv:2203.00382v1 [cs.LG])
    Deep Neural Networks for classification behave unpredictably when confronted with inputs not stemming from the training distribution. This motivates out-of-distribution detection (OOD) mechanisms. The usual lack of prior information on out-of-distribution data renders the performance estimation of detection approaches on unseen data difficult. Several contemporary evaluation protocols are based on open set simulations, which average the performance over up to five synthetic random splits of a dataset into in- and out-of-distribution samples. However, the number of possible splits may be much larger, and the performance of Deep Neural Networks is known to fluctuate significantly depending on different sources of random variation. We empirically demonstrate that current protocols may fail to provide reliable estimates of the expected performance of OOD methods. By casting this evaluation as a random process, we generalize the concept of open set simulations and propose to estimate the performance of OOD methods using a Monte Carlo approach that addresses the randomness.
    Effectiveness of Delivered Information Trade Study. (arXiv:2203.00116v1 [cs.CV])
    The sensor to shooter timeline is affected by two main variables: satellite positioning and asset positioning. Speeding up satellite positioning by adding more sensors or by decreasing processing time is important only if there is a prepared shooter, otherwise the main source of time is getting the shooter into position. However, the intelligence community should work towards the exploitation of sensors to the highest speed and effectiveness possible. Achieving a high effectiveness while keeping speed high is a tradeoff that must be considered in the sensor to shooter timeline. In this paper we investigate two main ideas, increasing the effectiveness of satellite imagery through image manipulation and how on-board image manipulation would affect the sensor to shooter timeline. We cover these ideas in four scenarios: Discrete Event Simulation of onboard processing versus ground station processing, quality of information with cloud cover removal, information improvement with super resolution, and data reduction with image to caption. This paper will show how image manipulation techniques such as Super Resolution, Cloud Removal, and Image to Caption will improve the quality of delivered information in addition to showing how those processes effect the sensor to shooter timeline.
    GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks. (arXiv:2203.00158v1 [cs.AR])
    Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a series of sparse-dense matrix multiplication. However, prior work frequently suffers from inefficient data movements, leaving significant performance left on the table. We present GROW, a GCN accelerator based on Gustavson's algorithm to architect a row-wise product based sparse-dense GEMM accelerator. GROW co-designs the software/hardware that strikes a balance in locality and parallelism for GCNs, achieving significant energy-efficiency improvements vs. state-of-the-art GCN accelerators.
    On the sample complexity of stabilizing linear dynamical systems from data. (arXiv:2203.00474v1 [math.OC])
    Learning controllers from data for stabilizing dynamical systems typically follows a two step process of first identifying a model and then constructing a controller based on the identified model. However, learning models means identifying generic descriptions of the dynamics of systems, which can require large amounts of data and extracting information that are unnecessary for the specific task of stabilization. The contribution of this work is to show that if a linear dynamical system has dimension (McMillan degree) $n$, then there always exist $n$ states from which a stabilizing feedback controller can be constructed, independent of the dimension of the representation of the observed states and the number of inputs. By building on previous work, this finding implies that any linear dynamical system can be stabilized from fewer observed states than the minimal number of states required for learning a model of the dynamics. The theoretical findings are demonstrated with numerical experiments that show the stabilization of the flow behind a cylinder from less data than necessary for learning a model.
    Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks. (arXiv:2203.00199v1 [cs.LG])
    Graph neural networks (GNN) have shown great advantages in many graph-based learning tasks but often fail to predict accurately for a task-based on sets of nodes such as link/motif prediction and so on. Many works have recently proposed to address this problem by using random node features or node distance features. However, they suffer from either slow convergence, inaccurate prediction, or high complexity. In this work, we revisit GNNs that allow using positional features of nodes given by positional encoding (PE) techniques such as Laplacian Eigenmap, Deepwalk, etc. GNNs with PE often get criticized because they are not generalizable to unseen graphs (inductive) or stable. Here, we study these issues in a principled way and propose a provable solution, a class of GNN layers termed PEG with rigorous mathematical analysis. PEG uses separate channels to update the original node features and positional features. PEG imposes permutation equivariance w.r.t. the original node features and rotation equivariance w.r.t. the positional features simultaneously. Extensive link prediction experiments over 8 real-world networks demonstrate the advantages of PEG in generalization and scalability.
    Improving Time Sensitivity for Question Answering over Temporal Knowledge Graphs. (arXiv:2203.00255v1 [cs.CL])
    Question answering over temporal knowledge graphs (KGs) efficiently uses facts contained in a temporal KG, which records entity relations and when they occur in time, to answer natural language questions (e.g., "Who was the president of the US before Obama?"). These questions often involve three time-related challenges that previous work fail to adequately address: 1) questions often do not specify exact timestamps of interest (e.g., "Obama" instead of 2000); 2) subtle lexical differences in time relations (e.g., "before" vs "after"); 3) off-the-shelf temporal KG embeddings that previous work builds on ignore the temporal order of timestamps, which is crucial for answering temporal-order related questions. In this paper, we propose a time-sensitive question answering (TSQA) framework to tackle these problems. TSQA features a timestamp estimation module to infer the unwritten timestamp from the question. We also employ a time-sensitive KG encoder to inject ordering information into the temporal KG embeddings that TSQA is based on. With the help of techniques to reduce the search space for potential answers, TSQA significantly outperforms the previous state of the art on a new benchmark for question answering over temporal KGs, especially achieving a 32% (absolute) error reduction on complex questions that require multiple steps of reasoning over facts in the temporal KG.
    DAMO-NLP at SemEval-2022 Task 11: A Knowledge-based System for Multilingual Named Entity Recognition. (arXiv:2203.00545v1 [cs.CL])
    The MultiCoNER shared task aims at detecting semantically ambiguous and complex named entities in short and low-context settings for multiple languages. The lack of contexts makes the recognition of ambiguous named entities challenging. To alleviate this issue, our team DAMO-NLP proposes a knowledge-based system, where we build a multilingual knowledge base based on Wikipedia to provide related context information to the named entity recognition (NER) model. Given an input sentence, our system effectively retrieves related contexts from the knowledge base. The original input sentences are then augmented with such context information, allowing significantly better contextualized token representations to be captured. Our system wins 10 out of 13 tracks in the MultiCoNER shared task.
    Combining Modular Skills in Multitask Learning. (arXiv:2202.13914v2 [cs.LG] UPDATED)
    A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / low-rank) model parameterisations. By jointly learning these and a task-skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a two-speed learning rate. We evaluate our latent-skill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.
    SUTD-PRCM Dataset and Neural Architecture Search Approach for Complex Metasurface Design. (arXiv:2203.00002v1 [cs.LG])
    Metasurfaces have received a lot of attentions recently due to their versatile capability in manipulating electromagnetic wave. Advanced designs to satisfy multiple objectives with non-linear constraints have motivated researchers in using machine learning (ML) techniques like deep learning (DL) for accelerated design of metasurfaces. For metasurfaces, it is difficult to make quantitative comparisons between different ML models without having a common and yet complex dataset used in many disciplines like image classification. Many studies were directed to a relatively constrained datasets that are limited to specified patterns or shapes in metasurfaces. In this paper, we present our SUTD polarized reflection of complex metasurfaces (SUTD-PRCM) dataset, which contains approximately 260,000 samples of complex metasurfaces created from electromagnetic simulation, and it has been used to benchmark our DL models. The metasurface patterns are divided into different classes to facilitate different degree of complexity, which involves identifying and exploiting the relationship between the patterns and the electromagnetic responses that can be compared in using different DL models. With the release of this SUTD-PRCM dataset, we hope that it will be useful for benchmarking existing or future DL models developed in the ML community. We also propose a classification problem that is less encountered and apply neural architecture search to have a preliminary understanding of potential modification to the neural architecture that will improve the prediction by DL models. Our finding shows that convolution stacking is not the dominant element of the neural architecture anymore, which implies that low-level features are preferred over the traditional deep hierarchical high-level features thus explains why deep convolutional neural network based models are not performing well in our dataset.
    A Framework for Real-World Multi-Robot Systems Running Decentralized GNN-Based Policies. (arXiv:2111.01777v2 [cs.RO] UPDATED)
    GNNs are a paradigm-shifting neural architecture to facilitate the learning of complex multi-agent behaviors. Recent work has demonstrated remarkable performance in tasks such as flocking, multi-agent path planning and cooperative coverage. However, the policies derived through GNN-based learning schemes have not yet been deployed to the real-world on physical multi-robot systems. In this work, we present the design of a system that allows for fully decentralized execution of GNN-based policies. We create a framework based on ROS2 and elaborate its details in this paper. We demonstrate our framework on a case-study that requires tight coordination between robots, and present first-of-a-kind results that show successful real-world deployment of GNN-based policies on a decentralized multi-robot system relying on Adhoc communication. A video demonstration of this case-study, as well as the accompanying source code repository, can be found online. https://www.youtube.com/watch?v=COh-WLn4iO4 https://github.com/proroklab/ros2_multi_agent_passage https://github.com/proroklab/rl_multi_agent_passage
    Explainability for identification of vulnerable groups in machine learning models. (arXiv:2203.00317v1 [cs.LG])
    If a prediction model identifies vulnerable individuals or groups, the use of that model may become an ethical issue. But can we know that this is what a model does? Machine learning fairness as a field is focused on the just treatment of individuals and groups under information processing with machine learning methods. While considerable attention has been given to mitigating discrimination of protected groups, vulnerable groups have not received the same attention. Unlike protected groups, which can be regarded as always vulnerable, a vulnerable group may be vulnerable in one context but not in another. This raises new challenges on how and when to protect vulnerable individuals and groups under machine learning. Methods from explainable artificial intelligence (XAI), in contrast, do consider more contextual issues and are concerned with answering the question "why was this decision made?". Neither existing fairness nor existing explainability methods allow us to ascertain if a prediction model identifies vulnerability. We discuss this problem and propose approaches for analysing prediction models in this respect.
    Global-Local Regularization Via Distributional Robustness. (arXiv:2203.00553v1 [cs.LG])
    Despite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of achieving some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect which is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains: semi-supervised learning, domain adaptation, domain generalization, and adversarial machine learning.
    On Testability and Goodness of Fit Tests in Missing Data Models. (arXiv:2203.00132v1 [stat.ME])
    Significant progress has been made in developing identification and estimation techniques for missing data problems where modeling assumptions can be described via a directed acyclic graph. The validity of results using such techniques rely on the assumptions encoded by the graph holding true; however, verification of these assumptions has not received sufficient attention in prior work. In this paper, we provide new insights on the testable implications of three broad classes of missing data graphical models, and design goodness-of-fit tests around them. The classes of models explored are: sequential missing-at-random and missing-not-at-random models which can be used for modeling longitudinal studies with dropout/censoring, and a kind of no self-censoring model which can be applied to cross-sectional studies and surveys.
    Differentially private training of residual networks with scale normalisation. (arXiv:2203.00324v1 [cs.LG])
    We investigate the optimal choice of replacement layer for Batch Normalisation (BN) in residual networks (ResNets) for training with Differentially Private Stochastic Gradient Descent (DP-SGD) and study the phenomenon of scale mixing in residual blocks, whereby the activations on the two branches are scaled differently. Our experimental evaluation indicates that a hyperparameter search over 1-64 Group Normalisation (GN) groups improves the accuracy of ResNet-9 and ResNet-50 considerably in both benchmark (CIFAR-10) and large-image (ImageNette) tasks. Moreover, Scale Normalisation, a simple modification to the model architecture by which an additional normalisation layer is introduced after the residual block's addition operation further improves the utility of ResNets allowing us to achieve state-of-the-art results on CIFAR-10.
    Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?. (arXiv:2203.00035v1 [cs.LG])
    Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
    Technological evaluation of two AFIS systems. (arXiv:2203.00447v1 [cs.CR])
    This paper provides a technological evaluation of two Automatic Fingerprint Identification Systems (AFIS) used in forensic applications. Both of them are installed and working in Spanish police premises. The first one is a Printrak AFIS 2000 system with a database of more than 450,000 fingerprints, while the second one is a NEC AFIS 21 SAID NT-LEXS Release 2.4.4 with a database of more than 15 million fingerprints. Our experiments reveal that although both systems can manage inkless fingerprints, the latest one offers better experimental results
    RPT: Toward Transferable Model on Heterogeneous Researcher Data via Pre-Training. (arXiv:2110.07336v2 [cs.IR] UPDATED)
    With the growth of the academic engines, the mining and analysis acquisition of massive researcher data, such as collaborator recommendation and researcher retrieval, has become indispensable. It can improve the quality of services and intelligence of academic engines. Most of the existing studies for researcher data mining focus on a single task for a particular application scenario and learning a task-specific model, which is usually unable to transfer to out-of-scope tasks. The pre-training technology provides a generalized and sharing model to capture valuable information from enormous unlabeled data. The model can accomplish multiple downstream tasks via a few fine-tuning steps. In this paper, we propose a multi-task self-supervised learning-based researcher data pre-training model named RPT. Specifically, we divide the researchers' data into semantic document sets and community graph. We design the hierarchical Transformer and the local community encoder to capture information from the two categories of data, respectively. Then, we propose three self-supervised learning objectives to train the whole model. Finally, we also propose two transfer modes of RPT for fine-tuning in different scenarios. We conduct extensive experiments to evaluate RPT, results on three downstream tasks verify the effectiveness of pre-training for researcher data mining.
    Distributional Reinforcement Learning for Scheduling of (Bio)chemical Production Processes. (arXiv:2203.00636v1 [eess.SY])
    Reinforcement Learning (RL) has recently received significant attention from the process systems engineering and control communities. Recent works have investigated the application of RL to identify optimal scheduling decision in the presence of uncertainty. In this work, we present a RL methodology to address precedence and disjunctive constraints as commonly imposed on production scheduling problems. This work naturally enables the optimization of risk-sensitive formulations such as the conditional value-at-risk (CVaR), which are essential in realistic scheduling processes. The proposed strategy is investigated thoroughly in a single-stage, parallel batch production environment, and benchmarked against mixed integer linear programming (MILP) strategies. We show that the policy identified by our approach is able to account for plant uncertainties in online decision-making, with expected performance comparable to existing MILP methods. Additionally, the framework gains the benefits of optimizing for risk-sensitive measures, and identifies decisions orders of magnitude faster than the most efficient optimization approaches. This promises to mitigate practical issues and ease in handling realizations of process uncertainty in the paradigm of online production scheduling.
    Interpretable Molecular Graph Generation via Monotonic Constraints. (arXiv:2203.00412v1 [cs.LG])
    Designing molecules with specific properties is a long-lasting research problem and is central to advancing crucial domains such as drug discovery and material science. Recent advances in deep graph generative models treat molecule design as graph generation problems which provide new opportunities toward the breakthrough of this long-lasting problem. Existing models, however, have many shortcomings, including poor interpretability and controllability toward desired molecular properties. This paper focuses on new methodologies for molecule generation with interpretable and controllable deep generative models, by proposing new monotonically-regularized graph variational autoencoders. The proposed models learn to represent the molecules with latent variables and then learn the correspondence between them and molecule properties parameterized by polynomial functions. To further improve the intepretability and controllability of molecule generation towards desired properties, we derive new objectives which further enforce monotonicity of the relation between some latent variables and target molecule properties such as toxicity and clogP. Extensive experimental evaluation demonstrates the superiority of the proposed framework on accuracy, novelty, disentanglement, and control towards desired molecular properties. The code is open-source at https://anonymous.4open.science/r/MDVAE-FD2C.
    Path sampling of recurrent neural networks by incorporating known physics. (arXiv:2203.00597v1 [cond-mat.dis-nn])
    Recurrent neural networks have seen widespread use in modeling dynamical systems in varied domains such as weather prediction, text prediction and several others. Often one wishes to supplement the experimentally observed dynamics with prior knowledge or intuition about the system. While the recurrent nature of these networks allows them to model arbitrarily long memories in the time series used in training, it makes it harder to impose prior knowledge or intuition through generic constraints. In this work, we present a path sampling approach based on principle of Maximum Caliber that allows us to include generic thermodynamic or kinetic constraints into recurrent neural networks. We show the method here for a widely used type of recurrent neural network known as long short-term memory network in the context of supplementing time series collecting from all-atom molecular dynamics. We demonstrate the power of the formalism for different applications. Our method can be easily generalized to other generative artificial intelligence models and to generic time series in different areas of physical and social sciences, where one wishes to supplement limited data with intuition or theory based corrections.  ( 2 min )
    Private Frequency Estimation via Projective Geometry. (arXiv:2203.00194v1 [cs.CR])
    In this work, we propose a new algorithm ProjectiveGeometryResponse (PGR) for locally differentially private (LDP) frequency estimation. For a universe size of $k$ and with $n$ users, our $\varepsilon$-LDP algorithm has communication cost $\lceil\log_2k\rceil$ bits in the private coin setting and $\varepsilon\log_2 e + O(1)$ in the public coin setting, and has computation cost $O(n + k\exp(\varepsilon) \log k)$ for the server to approximately reconstruct the frequency histogram, while achieving the state-of-the-art privacy-utility tradeoff. In many parameter settings used in practice this is a significant improvement over the $ O(n+k^2)$ computation cost that is achieved by the recent PI-RAPPOR algorithm (Feldman and Talwar; 2021). Our empirical evaluation shows a speedup of over 50x over PI-RAPPOR while using approximately 75x less memory for practically relevant parameter settings. In addition, the running time of our algorithm is within an order of magnitude of HadamardResponse (Acharya, Sun, and Zhang; 2019) and RecursiveHadamardResponse (Chen, Kairouz, and Ozgur; 2020) which have significantly worse reconstruction error. The error of our algorithm essentially matches that of the communication- and time-inefficient but utility-optimal SubsetSelection (SS) algorithm (Ye and Barg; 2017). Our new algorithm is based on using Projective Planes over a finite field to define a small collection of sets that are close to being pairwise independent and a dynamic programming algorithm for approximate histogram reconstruction on the server side. We also give an extension of PGR, which we call HybridProjectiveGeometryResponse, that allows trading off computation time with utility smoothly.
    RAB: Provable Robustness Against Backdoor Attacks. (arXiv:2003.08904v6 [cs.LG] UPDATED)
    Recent studies have shown that deep neural networks are vulnerable to adversarial attacks, including evasion and backdoor (poisoning) attacks. On the defense side, there have been intensive efforts on improving both empirical and provable robustness against evasion attacks; however, provable robustness against backdoor attacks still remains largely unexplored. In this paper, we focus on certifying the machine learning model robustness against general threat models, especially backdoor attacks. We first provide a unified framework via randomized smoothing techniques and show how it can be instantiated to certify the robustness against both evasion and backdoor attacks. We then propose the first robust training process, RAB, to smooth the trained model and certify its robustness against backdoor attacks. We theoretically prove the robustness bound for machine learning models trained with RAB, and prove that our robustness bound is tight. We derive the robustness conditions for different smoothing distributions including Gaussian and uniform distributions. In addition, we theoretically show that it is possible to train the robust smoothed models efficiently for simple models such as K-nearest neighbor classifiers, and we propose an exact smooth-training algorithm which eliminates the need to sample from a noise distribution for such models. Empirically, we conduct comprehensive experiments for different machine learning models such as DNNs and K-NN models on MNIST, CIFAR-10, and ImageNette datasets and provide the first benchmark for certified robustness against backdoor attacks. In addition, we evaluate K-NN models on a spambase tabular dataset to demonstrate the advantages of the proposed exact algorithm. Both the theoretic analysis and the comprehensive evaluation on diverse ML models and datasets shed lights on further robust learning strategies against general training time attacks.  ( 3 min )
    A Probabilistic Deep Image Prior for Computational Tomography. (arXiv:2203.00479v1 [eess.IV])
    Existing deep-learning based tomographic image reconstruction methods do not provide accurate estimates of reconstruction uncertainty, hindering their real-world deployment. To address this limitation, we construct a Bayesian prior for tomographic reconstruction, which combines the classical total variation (TV) regulariser with the modern deep image prior (DIP). Specifically, we use a change of variables to connect our prior beliefs on the image TV semi-norm with the hyper-parameters of the DIP network. For the inference, we develop an approach based on the linearised Laplace method, which is scalable to high-dimensional settings. The resulting framework provides pixel-wise uncertainty estimates and a marginal likelihood objective for hyperparameter optimisation. We demonstrate the method on synthetic and real-measured high-resolution $\mu$CT data, and show that it provides superior calibration of uncertainty estimates relative to previous probabilistic formulations of the DIP.  ( 2 min )
    On classification of strategic agents who can both game and improve. (arXiv:2203.00124v1 [cs.GT])
    In this work, we consider classification of agents who can both game and improve. For example, people wishing to get a loan may be able to take some actions that increase their perceived credit-worthiness and others that also increase their true credit-worthiness. A decision-maker would like to define a classification rule with few false-positives (does not give out many bad loans) while yielding many true positives (giving out many good loans), which includes encouraging agents to improve to become true positives if possible. We consider two models for this problem, a general discrete model and a linear model, and prove algorithmic, learning, and hardness results for each. For the general discrete model, we give an efficient algorithm for the problem of maximizing the number of true positives subject to no false positives, and show how to extend this to a partial-information learning setting. We also show hardness for the problem of maximizing the number of true positives subject to a nonzero bound on the number of false positives, and that this hardness holds even for a finite-point version of our linear model. We also show that maximizing the number of true positives subject to no false positive is NP-hard in our full linear model. We additionally provide an algorithm that determines whether there exists a linear classifier that classifies all agents accurately and causes all improvable agents to become qualified, and give additional results for low-dimensional data.  ( 2 min )
    GCN-Transformer for short-term passenger flow prediction on holidays in urban rail transit systems. (arXiv:2203.00007v1 [cs.LG])
    The short-term passenger flow prediction of the urban rail transit system is of great significance for traffic operation and management. The emerging deep learning-based models provide effective methods to improve prediction accuracy. However, most of the existing models mainly predict the passenger flow on general weekdays, while few studies focus on predicting the holiday passenger flow, which can provide more significant information for operators because congestions or accidents generally occur on holidays. To this end, we propose a deep learning-based model named GCN-Transformer comprising graph conventional neural network (GCN) and Transformer for short-term passenger flow prediction on holidays. The GCN is applied to extract the spatial features of passenger flows and the Transformer is applied to extract the temporal features of passenger flows. Moreover, in addition to the historical passenger flow data, social media data are also incorporated into the prediction model, which has been proven to have a potential correlation with the fluctuation of passenger flow. The GCN-Transformer is tested on two large-scale real-world datasets from Nanning, China during the New Year holiday and is compared with several conventional prediction models. Results demonstrate its better robustness and advantages among baseline methods, which provides overwhelming support for practical applications of short-term passenger flow prediction on holidays
    Learning Neural Hamiltonian Dynamics: A Methodological Overview. (arXiv:2203.00128v1 [cs.LG])
    The past few years have witnessed an increased interest in learning Hamiltonian dynamics in deep learning frameworks. As an inductive bias based on physical laws, Hamiltonian dynamics endow neural networks with accurate long-term prediction, interpretability, and data-efficient learning. However, Hamiltonian dynamics also bring energy conservation or dissipation assumptions on the input data and additional computational overhead. In this paper, we systematically survey recently proposed Hamiltonian neural network models, with a special emphasis on methodologies. In general, we discuss the major contributions of these models, and compare them in four overlapping directions: 1) generalized Hamiltonian system; 2) symplectic integration, 3) generalized input form, and 4) extended problem settings. We also provide an outlook of the fundamental challenges and emerging opportunities in this area.
    Towards Practices for Human-Centered Machine Learning. (arXiv:2203.00432v1 [cs.LG])
    "Human-centered machine learning" (HCML) is a term that describes machine learning that applies to human-focused problems. Although this idea is noteworthy and generates scholarly excitement, scholars and practitioners have struggled to clearly define and implement HCML in computer science. This article proposes practices for human-centered machine learning, an area where studying and designing for social, cultural, and ethical implications are just as important as technical advances in ML. These practices bridge between interdisciplinary perspectives of HCI, AI, and sociotechnical fields, as well as ongoing discourse on this new area. The five practices include ensuring HCML is the appropriate solution space for a problem; conceptualizing problem statements as position statements; moving beyond interaction models to define the human; legitimizing domain contributions; and anticipating sociotechnical failure. I conclude by suggesting how these practices might be implemented in research and practice.
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v1 [cs.LG])
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
    Mixed Variational Inference. (arXiv:1901.04791v4 [stat.ML] UPDATED)
    The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors.  ( 2 min )
    Approximating a deep reinforcement learning docking agent using linear model trees. (arXiv:2203.00369v1 [cs.RO])
    Deep reinforcement learning has led to numerous notable results in robotics. However, deep neural networks (DNNs) are unintuitive, which makes it difficult to understand their predictions and strongly limits their potential for real-world applications due to economic, safety, and assurance reasons. To remedy this problem, a number of explainable AI methods have been presented, such as SHAP and LIME, but these can be either be too costly to be used in real-time robotic applications or provide only local explanations. In this paper, the main contribution is the use of a linear model tree (LMT) to approximate a DNN policy, originally trained via proximal policy optimization(PPO), for an autonomous surface vehicle with five control inputs performing a docking operation. The two main benefits of the proposed approach are: a) LMTs are transparent which makes it possible to associate directly the outputs (control actions, in our case) with specific values of the input features, b) LMTs are computationally efficient and can provide information in real-time. In our simulations, the opaque DNN policy controls the vehicle and the LMT runs in parallel to provide explanations in the form of feature attributions. Our results indicate that LMTs can be a useful component within digital assurance frameworks for autonomous ships.  ( 2 min )
    GraphWorld: Fake Graphs Bring Real Insights for GNNs. (arXiv:2203.00112v1 [cs.LG])
    Despite advances in the field of Graph Neural Networks (GNNs), only a small number (~5) of datasets are currently used to evaluate new models. This continued reliance on a handful of datasets provides minimal insight into the performance differences between models, and is especially challenging for industrial practitioners who are likely to have datasets which look very different from those used as academic benchmarks. In the course of our work on GNN infrastructure and open-source software at Google, we have sought to develop improved benchmarks that are robust, tunable, scalable,and generalizable. In this work we introduce GraphWorld, a novel methodology and system for benchmarking GNN models on an arbitrarily-large population of synthetic graphs for any conceivable GNN task. GraphWorld allows a user to efficiently generate a world with millions of statistically diverse datasets. It is accessible, scalable, and easy to use. GraphWorld can be run on a single machine without specialized hardware, or it can be easily scaled up to run on arbitrary clusters or cloud frameworks. Using GraphWorld, a user has fine-grained control over graph generator parameters, and can benchmark arbitrary GNN models with built-in hyperparameter tuning. We present insights from GraphWorld experiments regarding the performance characteristics of tens of thousands of GNN models over millions of benchmark datasets. We further show that GraphWorld efficiently explores regions of benchmark dataset space uncovered by standard benchmarks, revealing comparisons between models that have not been historically obtainable. Using GraphWorld, we also are able to study in-detail the relationship between graph properties and task performance metrics, which is nearly impossible with the classic collection of real-world benchmarks.  ( 2 min )
    Multi-task Learning Approach for Modulation and Wireless Signal Classification for 5G and Beyond: Edge Deployment via Model Compression. (arXiv:2203.00517v1 [eess.SP])
    Future communication networks must address the scarce spectrum to accommodate extensive growth of heterogeneous wireless devices. Wireless signal recognition is becoming increasingly more significant for spectrum monitoring, spectrum management, secure communications, among others. Consequently, comprehensive spectrum awareness on the edge has the potential to serve as a key enabler for the emerging beyond 5G networks. State-of-the-art studies in this domain have (i) only focused on a single task - modulation or signal (protocol) classification - which in many cases is insufficient information for a system to act on, (ii) consider either radar or communication waveforms (homogeneous waveform category), and (iii) does not address edge deployment during neural network design phase. In this work, for the first time in the wireless communication domain, we exploit the potential of deep neural networks based multi-task learning (MTL) framework to simultaneously learn modulation and signal classification tasks while considering heterogeneous wireless signals such as radar and communication waveforms in the electromagnetic spectrum. The proposed MTL architecture benefits from the mutual relation between the two tasks in improving the classification accuracy as well as the learning efficiency with a lightweight neural network model. We additionally include experimental evaluations of the model with over-the-air collected samples and demonstrate first-hand insight on model compression along with deep learning pipeline for deployment on resource-constrained edge devices. We demonstrate significant computational, memory, and accuracy improvement of the proposed model over two reference architectures. In addition to modeling a lightweight MTL model suitable for resource-constrained embedded radio platforms, we provide a comprehensive heterogeneous wireless signals dataset for public use.
    Gradient Importance Learning for Incomplete Observations. (arXiv:2107.01983v4 [cs.LG] UPDATED)
    Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification. This is particularly true if the data have large missingness rates or a small sample size. More importantly, the imputation error could be propagated into the prediction step that follows, which may constrain the capabilities of the prediction model. In this work, we introduce the gradient importance learning (GIL) method to train multilayer perceptrons (MLPs) and long short-term memories (LSTMs) to directly perform inference from inputs containing missing values without imputation. Specifically, we employ reinforcement learning (RL) to adjust the gradients used to train these models via back-propagation. This allows the model to exploit the underlying information behind missingness patterns. We test the approach on real-world time-series (i.e., MIMIC-III), tabular data obtained from an eye clinic, and a standard dataset (i.e., MNIST), where our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
    Towards quantum advantage via topological data analysis. (arXiv:2005.02607v4 [quant-ph] UPDATED)
    Even after decades of quantum computing development, examples of generally useful quantum algorithms with exponential speedups over classical counterparts are scarce. Recent progress in quantum algorithms for linear-algebra positioned quantum machine learning (QML) as a potential source of such useful exponential improvements. Yet, in an unexpected development, a recent series of "dequantization" results has equally rapidly removed the promise of exponential speedups for several QML algorithms. This raises the critical question whether exponential speedups of other linear-algebraic QML algorithms persist. In this paper, we study the quantum-algorithmic methods behind the algorithm for topological data analysis of Lloyd, Garnerone and Zanardi through this lens. We provide evidence that the problem solved by this algorithm is classically intractable by showing that its natural generalization is as hard as simulating the one clean qubit model -- which is widely believed to require superpolynomial time on a classical computer -- and is thus very likely immune to dequantizations. Based on this result, we provide a number of new quantum algorithms for problems such as rank estimation and complex network analysis, along with complexity-theoretic evidence for their classical intractability. Furthermore, we analyze the suitability of the proposed quantum algorithms for near-term implementations. Our results provide a number of useful applications for full-blown, and restricted quantum computers with a guaranteed exponential speedup over classical methods, recovering some of the potential for linear-algebraic QML to become one of quantum computing's killer applications.
    TRILLsson: Distilled Universal Paralinguistic Speech Representations. (arXiv:2203.00236v1 [eess.AS])
    Recent advances in self-supervision have dramatically improved the quality of speech representations. However, deployment of state-of-the-art embedding models on devices has been restricted due to their limited public availability and large resource footprint. Our work addresses these issues by publicly releasing a collection of paralinguistic speech models that are small and near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled on public data only. We explore different architectures and thoroughly evaluate our models on the Non-Semantic Speech (NOSS) benchmark. Our largest distilled model is less than 15% the size of the original model (314MB vs 2.2GB), achieves over 96% the accuracy on 6 of 7 tasks, and is trained on 6.5% the data. The smallest model is 1% in size (22MB) and achieves over 90% the accuracy on 6 of 7 tasks. Our models outperform the open source Wav2Vec 2.0 model on 6 of 7 tasks, and our smallest model outperforms the open source Wav2Vec 2.0 on both emotion recognition tasks despite being 7% the size.
    MRI-GAN: A Generalized Approach to Detect DeepFakes using Perceptual Image Assessment. (arXiv:2203.00108v1 [cs.CV])
    DeepFakes are synthetic videos generated by swapping a face of an original image with the face of somebody else. In this paper, we describe our work to develop general, deep learning-based models to classify DeepFake content. We propose a novel framework for using Generative Adversarial Network (GAN)-based models, we call MRI-GAN, that utilizes perceptual differences in images to detect synthesized videos. We test our MRI-GAN approach and a plain-frames-based model using the DeepFake Detection Challenge Dataset. Our plain frames-based-model achieves 91% test accuracy and a model which uses our MRI-GAN framework with Structural Similarity Index Measurement (SSIM) for the perceptual differences achieves 74% test accuracy. The results of MRI-GAN are preliminary and may be improved further by modifying the choice of loss function, tuning hyper-parameters, or by using a more advanced perceptual similarity metric.  ( 2 min )
    Structure from Voltage. (arXiv:2203.00063v1 [cs.LG])
    Effective resistance (ER) is an attractive way to interrogate the structure of graphs. It is an alternative to computing the eigen-vectors of the graph Laplacian. Graph laplacians are used to find low dimensional structures in high dimensional data. Here too, ER based analysis has advantages over eign-vector based methods. Unfortunately Von Luxburg et al. (2010) show that, when vertices correspond to a sample from a distribution over a metric space, the limit of the ER between distant points converges to a trivial quantity that holds no information about the structure of the graph. We show that by using scaling resistances in a graph with $n$ vertices by $n^2$, one gets a meaningful limit of the voltages and of effective resistances. We also show that by adding a "ground" node to a metric graph one gets a simple and natural way to compute all of the distances from a chosen point to all other points.
    Dynamic N:M Fine-grained Structured Sparse Attention Mechanism. (arXiv:2203.00091v1 [cs.LG])
    Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. Tremendous efforts have been made to alleviate this problem, and many of them successfully reduce the asymptotic complexity to linear. Nevertheless, most of them fail to achieve practical speedup over the original full attention under moderate sequence lengths and are unfriendly to finetuning. In this paper, we present DFSS, an attention mechanism that dynamically prunes the full attention weight matrix to N:M fine-grained structured sparse pattern. We provide both theoretical and empirical evidence that demonstrates DFSS is a good approximation of the full attention mechanism. We propose a dedicated CUDA kernel design that completely eliminates the dynamic pruning overhead and achieves speedups under arbitrary sequence length. We evaluate the 1:2 and 2:4 sparsity under different configurations and achieve 1.27~ 1.89x speedups over the full-attention mechanism. It only takes a couple of finetuning epochs from the pretrained model to achieve on par accuracy with full attention mechanism on tasks from various domains under different sequence lengths from 384 to 4096.
    Affordance Learning from Play for Sample-Efficient Policy Learning. (arXiv:2203.00352v1 [cs.RO])
    Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment. We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies. We find that our policies train 4x faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions.
    Machine Learning Empowered Intelligent Data Center Networking: A Survey. (arXiv:2202.13549v2 [cs.NI] UPDATED)
    To support the needs of ever-growing cloud-based services, the number of servers and network devices in data centers is increasing exponentially, which in turn results in high complexities and difficulties in network optimization. To address these challenges, both academia and industry turn to artificial intelligence technology to realize network intelligence. To this end, a considerable number of novel and creative machine learning-based (ML-based) research works have been put forward in recent few years. Nevertheless, there are still enormous challenges faced by the intelligent optimization of data center networks (DCNs), especially in the scenario of online real-time dynamic processing of massive heterogeneous services and traffic data. To best of our knowledge, there is a lack of systematic and original comprehensively investigations with in-depth analysis on intelligent DCN. To this end, in this paper, we comprehensively investigate the application of machine learning to data center networking, and provide a general overview and in-depth analysis of the recent works, covering flow prediction, flow classification, load balancing, resource management, routing optimization, and congestion control. In order to provide a multi-dimensional and multi-perspective comparison of various solutions, we design a quality assessment criteria called REBEL-3S to impartially measure the strengths and weaknesses of these research works. Moreover, we also present unique insights into the technology evolution of the fusion of data center network and machine learning, together with some challenges and potential future research opportunities.
    Multi-Task Multi-Scale Learning For Outcome Prediction in 3D PET Images. (arXiv:2203.00641v1 [eess.IV])
    Background and Objectives: Predicting patient response to treatment and survival in oncology is a prominent way towards precision medicine. To that end, radiomics was proposed as a field of study where images are used instead of invasive methods. The first step in radiomic analysis is the segmentation of the lesion. However, this task is time consuming and can be physician subjective. Automated tools based on supervised deep learning have made great progress to assist physicians. However, they are data hungry, and annotated data remains a major issue in the medical field where only a small subset of annotated images is available. Methods: In this work, we propose a multi-task learning framework to predict patient's survival and response. We show that the encoder can leverage multiple tasks to extract meaningful and powerful features that improve radiomics performance. We show also that subsidiary tasks serve as an inductive bias so that the model can better generalize. Results: Our model was tested and validated for treatment response and survival in lung and esophageal cancers, with an area under the ROC curve of 77% and 71% respectively, outperforming single task learning methods. Conclusions: We show that, by using a multi-task learning approach, we can boost the performance of radiomic analysis by extracting rich information of intratumoral and peritumoral regions.
    A Theory of Abstraction in Reinforcement Learning. (arXiv:2203.00397v1 [cs.LG])
    Reinforcement learning defines the problem facing agents that learn to make good decisions through action and observation alone. To be effective problem solvers, such agents must efficiently explore vast worlds, assign credit from delayed feedback, and generalize to new experiences, all while making use of limited data, computational resources, and perceptual bandwidth. Abstraction is essential to all of these endeavors. Through abstraction, agents can form concise models of their environment that support the many practices required of a rational, adaptive decision maker. In this dissertation, I present a theory of abstraction in reinforcement learning. I first offer three desiderata for functions that carry out the process of abstraction: they should 1) preserve representation of near-optimal behavior, 2) be learned and constructed efficiently, and 3) lower planning or learning time. I then present a suite of new algorithms and analysis that clarify how agents can learn to abstract according to these desiderata. Collectively, these results provide a partial path toward the discovery and use of abstraction that minimizes the complexity of effective reinforcement learning.
    Amortized Proximal Optimization. (arXiv:2203.00089v1 [cs.LG])
    We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.
    Wavelet-Based Multi-Class Seizure Type Classification System. (arXiv:2203.00511v1 [eess.SP])
    Epilepsy is one of the most common brain diseases that affect more than 1\% of the world's population. It is characterized by recurrent seizures, which come in different types and are treated differently. Electroencephalography (EEG) is commonly used in medical services to diagnose seizures and their types. The accurate identification of seizures helps to provide optimal treatment and accurate information to the patient. However, the manual diagnostic procedures of epileptic seizures are laborious and highly-specialized. Moreover, EEG manual evaluation is a process known to have a low inter-rater agreement among experts. This paper presents a novel automatic technique that involves extraction of specific features from EEG signals using Dual-tree Complex Wavelet Transform (DTCWT) and classifying them. We evaluated the proposed technique on TUH EEG Seizure Corpus (TUSZ) ver.1.5.2 dataset and compared the performance with existing state-of-the-art techniques using overall F1-score due to class imbalance seizure types. Our proposed technique achieved the best results of weighted F1-score of 99.1\% and 74.7\% for seizure-wise and patient-wise classification respectively, thereby setting new benchmark results for this dataset.
    A Dynamic Mode Decomposition Approach for Decentralized Spectral Clustering of Graphs. (arXiv:2203.00004v1 [cs.LG])
    We propose a novel robust decentralized graph clustering algorithm that is provably equivalent to the popular spectral clustering approach. Our proposed method uses the existing wave equation clustering algorithm that is based on propagating waves through the graph. However, instead of using a fast Fourier transform (FFT) computation at every node, our proposed approach exploits the Koopman operator framework. Specifically, we show that propagating waves in the graph followed by a local dynamic mode decomposition (DMD) computation at every node is capable of retrieving the eigenvalues and the local eigenvector components of the graph Laplacian, thereby providing local cluster assignments for all nodes. We demonstrate that the DMD computation is more robust than the existing FFT based approach and requires 20 times fewer steps of the wave equation to accurately recover the clustering information and reduces the relative error by orders of magnitude. We demonstrate the decentralized approach on a range of graph clustering problems.  ( 2 min )
    Disentangled Inference for GANs with Latently Invertible Autoencoder. (arXiv:1906.08090v4 [cs.LG] UPDATED)
    Generative Adversarial Networks (GANs) play an increasingly important role in machine learning. However, there is one fundamental issue hindering their practical applications: the absence of capability for encoding real-world samples. The conventional way of addressing this issue is to learn an encoder for GAN via Variational Auto-Encoder (VAE). In this paper, we show that the entanglement of the latent space for the VAE/GAN framework poses the main challenge for encoder learning. To address the entanglement issue and enable inference in GAN we propose a novel algorithm named Latently Invertible Autoencoder (LIA). The framework of LIA is that an invertible network and its inverse mapping are symmetrically embedded in the latent space of VAE. The decoder of LIA is first trained as a standard GAN with the invertible network and then the partial encoder is learned from a disentangled autoencoder by detaching the invertible network from LIA, thus avoiding the entanglement problem caused by the random latent space. Experiments conducted on the FFHQ face dataset and three LSUN datasets validate the effectiveness of LIA/GAN.
    Neural Ordinary Differential Equations for Nonlinear System Identification. (arXiv:2203.00120v1 [cs.LG])
    Neural ordinary differential equations (NODE) have been recently proposed as a promising approach for nonlinear system identification tasks. In this work, we systematically compare their predictive performance with current state-of-the-art nonlinear and classical linear methods. In particular, we present a quantitative study comparing NODE's performance against neural state-space models and classical linear system identification methods. We evaluate the inference speed and prediction performance of each method on open-loop errors across eight different dynamical systems. The experiments show that NODEs can consistently improve the prediction accuracy by an order of magnitude compared to benchmark methods. Besides improved accuracy, we also observed that NODEs are less sensitive to hyperparameters compared to neural state-space models. On the other hand, these performance gains come with a slight increase of computation at the inference time.
    Knowledge-driven Active Learning. (arXiv:2110.08265v2 [cs.LG] UPDATED)
    In the last few years, Deep Learning models have become increasingly popular. However, their deployment is still precluded in those contexts where the amount of supervised data is limited and manual labelling expensive. Active learning strategies aim at solving this problem by requiring supervision only on few unlabelled samples, which improve the most the model performances after adding them to the training set. Most strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. Here we propose a very different approach, taking into consideration domain knowledge. Indeed, in the case of multi-label classification, the relationships among classes offer a way to spot incoherent predictions, i.e., predictions where the model may most likely need supervision. We have developed a framework where first-order-logic knowledge is converted into constraints and their violation is checked as a natural guide for sample selection. We empirically demonstrate that knowledge-driven strategy outperforms standard uncertain strategies, particularly on those datasets where the domain knowledge is rich. Furthermore, we show how the proposed approach enables discovering data distributions lying far from initial training data.
    The complexity of quantum support vector machines. (arXiv:2203.00031v1 [quant-ph])
    Quantum support vector machines employ quantum circuits to define the kernel function. It has been shown that this approach offers a provable exponential speedup compared to any known classical algorithm for certain data sets. The training of such models corresponds to solving a convex optimization problem either via its primal or dual formulation. Due to the probabilistic nature of quantum mechanics, the training algorithms are affected by statistical uncertainty, which has a major impact on their complexity. We show that the dual problem can be solved in $\mathcal{O}(M^{4.67}/\varepsilon^2)$ quantum circuit evaluations, where $M$ denotes the size of the data set and $\varepsilon$ the solution accuracy. We prove under an empirically motivated assumption that the kernelized primal problem can alternatively be solved in $\mathcal{O}(\min \{ M^2/\varepsilon^6, \, 1/\varepsilon^{10} \})$ evaluations by employing a generalization of a known classical algorithm called Pegasos. Accompanying empirical results demonstrate these analytical complexities to be essentially tight. In addition, we investigate a variational approximation to quantum support vector machines and show that their heuristic training achieves considerably better scaling in our experiments.
    Beyond Gradients: Exploiting Adversarial Priors in Model Inversion Attacks. (arXiv:2203.00481v1 [cs.LG])
    Collaborative machine learning settings like federated learning can be susceptible to adversarial interference and attacks. One class of such attacks is termed model inversion attacks, characterised by the adversary reverse-engineering the model to extract representations and thus disclose the training data. Prior implementations of this attack typically only rely on the captured data (i.e. the shared gradients) and do not exploit the data the adversary themselves control as part of the training consortium. In this work, we propose a novel model inversion framework that builds on the foundations of gradient-based model inversion attacks, but additionally relies on matching the features and the style of the reconstructed image to data that is controlled by an adversary. Our technique outperforms existing gradient-based approaches both qualitatively and quantitatively, while still maintaining the same honest-but-curious threat model, allowing the adversary to obtain enhanced reconstructions while remaining concealed.
    Graph Normalized-LMP Algorithm for Signal Estimation Under Impulsive Noise. (arXiv:2203.00320v1 [eess.SP])
    In this paper, we introduce an adaptive graph normalized least mean pth power (GNLMP) algorithm for graph signal processing (GSP) that utilizes GSP techniques, including bandlimited filtering and node sampling, to estimate sampled graph signals under impulsive noise. Different from least-squares-based algorithms, such as the adaptive GSP Least Mean Squares (GLMS) algorithm and the normalized GLMS (GNLMS) algorithm, the GNLMP algorithm has the ability to reconstruct a graph signal that is corrupted by non-Gaussian noise with heavy-tailed characteristics. Compared to the recently introduced adaptive GSP least mean pth power (GLMP) algorithm, the GNLMP algorithm reduces the number of iterations to converge to a steady graph signal. The convergence condition of the GNLMP algorithm is derived, and the ability of the GNLMP algorithm to process multidimensional time-varying graph signals with multiple features is demonstrated as well. Simulations show the performance of the GNLMP algorithm in estimating steady-state and time-varying graph signals is faster than GLMP and more robust in comparison to GLMS and GNLMS.
    Private Convex Optimization via Exponential Mechanism. (arXiv:2203.00263v1 [cs.DS])
    In this paper, we study private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $\pi(x)\propto \exp(-k(F(x)+\mu\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(\epsilon,\delta)$-DP. Furthermore, we show how to implement this mechanism using $\widetilde{O}(n \min(d, n))$ queries to $f_i(x)$ for the DP-SCO where $n$ is the number of samples/users and $d$ is the ambient dimension. We also give a (nearly) matching lower bound $\widetilde{\Omega}(n \min(d, n))$ on the number of evaluation queries. Our results utilize the following tools that are of independent interest: (1) We prove Gaussian Differential Privacy (GDP) of the exponential mechanism if the loss function is strongly convex and the perturbation is Lipschitz. Our privacy bound is \emph{optimal} as it includes the privacy of Gaussian mechanism as a special case and is proved using the isoperimetric inequality for strongly log-concave measures. (2) We show how to sample from $\exp(-F(x)-\mu \|x\|^2_2/2)$ for $G$-Lipschitz $F$ with $\eta$ error in total variation (TV) distance using $\widetilde{O}((G^2/\mu) \log^2(d/\eta))$ unbiased queries to $F(x)$. This is the first sampler whose query complexity has \emph{polylogarithmic dependence} on both dimension $d$ and accuracy $\eta$.
    Pedagogical Demonstrations and Pragmatic Learning in Artificial Tutor-Learner Interactions. (arXiv:2203.00111v1 [cs.LG])
    When demonstrating a task, human tutors pedagogically modify their behavior by either "showing" the task rather than just "doing" it (exaggerating on relevant parts of the demonstration) or by giving demonstrations that best disambiguate the communicated goal. Analogously, human learners pragmatically infer the communicative intent of the tutor: they interpret what the tutor is trying to teach them and deduce relevant information for learning. Without such mechanisms, traditional Learning from Demonstration (LfD) algorithms will consider such demonstrations as sub-optimal. In this paper, we investigate the implementation of such mechanisms in a tutor-learner setup where both participants are artificial agents in an environment with multiple goals. Using pedagogy from the tutor and pragmatism from the learner, we show substantial improvements over standard learning from demonstrations.
    Parameter-free Mirror Descent. (arXiv:2203.00444v1 [cs.LG])
    We develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve similar results. We also apply our mirror descent framework to build new parameter-free implicit updates, as well as a simplified and improved unconstrained scale-free algorithm.
    When AUC meets DRO: Optimizing Partial AUC for Deep Learning with Non-Convex Convergence Guarantee. (arXiv:2203.00176v1 [cs.LG])
    In this paper, we propose systematic and efficient gradient-based methods for both one-way and two-way partial AUC (pAUC) maximization that are applicable to deep learning. We propose new formulations of pAUC surrogate objectives by using the distributionally robust optimization (DRO) to define the loss for each individual positive data. We consider two formulations of DRO, one of which is based on conditional-value-at-risk (CVaR) that yields a non-smooth but exact estimator for pAUC, and another one is based on a KL divergence regularized DRO that yields an inexact but smooth (soft) estimator for pAUC. For both one-way and two-way pAUC maximization, we propose two algorithms and prove their convergence for optimizing their two formulations, respectively. Experiments demonstrate the effectiveness of the proposed algorithms for pAUC maximization for deep learning on various datasets.
    Differentiable Matrix Elements with MadJax. (arXiv:2203.00057v1 [hep-ph])
    MadJax is a tool for generating and evaluating differentiable matrix elements of high energy scattering processes. As such, it is a step towards a differentiable programming paradigm in high energy physics that facilitates the incorporation of high energy physics domain knowledge, encoded in simulation software, into gradient based learning and optimization pipelines. MadJax comprises two components: (a) a plugin to the general purpose matrix element generator MadGraph that integrates matrix element and phase space sampling code with the JAX differentiable programming framework, and (b) a standalone wrapping API for accessing the matrix element code and its gradients, which are computed with automatic differentiation. The MadJax implementation and example applications of simulation based inference and normalizing flow based matrix element modeling, with capabilities enabled uniquely with differentiable matrix elements, are presented.
    An Analytical Approach to Compute the Exact Preimage of Feed-Forward Neural Networks. (arXiv:2203.00438v1 [cs.LG])
    Neural networks are a convenient way to automatically fit functions that are too complex to be described by hand. The downside of this approach is that it leads to build a black-box without understanding what happened inside. Finding the preimage would help to better understand how and why such neural networks had given such outputs. Because most of the neural networks are noninjective function, it is often impossible to compute it entirely only by a numerical way. The point of this study is to give a method to compute the exact preimage of any Feed-Forward Neural Network with linear or piecewise linear activation functions for hidden layers. In contrast to other methods, this one is not returning a unique solution for a unique output but returns analytically the entire and exact preimage.
    Interfacing Finite Elements with Deep Neural Operators for Fast Multiscale Modeling of Mechanics Problems. (arXiv:2203.00003v1 [cs.CE])
    Multiscale modeling is an effective approach for investigating multiphysics systems with largely disparate size features, where models with different resolutions or heterogeneous descriptions are coupled together for predicting the system's response. The solver with lower fidelity (coarse) is responsible for simulating domains with homogeneous features, whereas the expensive high-fidelity (fine) model describes microscopic features with refined discretization, often making the overall cost prohibitively high, especially for time-dependent problems. In this work, we explore the idea of multiscale modeling with machine learning and employ DeepONet, a neural operator, as an efficient surrogate of the expensive solver. DeepONet is trained offline using data acquired from the fine solver for learning the underlying and possibly unknown fine-scale dynamics. It is then coupled with standard PDE solvers for predicting the multiscale systems with new boundary/initial conditions in the coupling stage. The proposed framework significantly reduces the computational cost of multiscale simulations since the DeepONet inference cost is negligible, facilitating readily the incorporation of a plurality of interface conditions and coupling schemes. We present various benchmarks to assess accuracy and speedup, and in particular we develop a coupling algorithm for a time-dependent problem, and we also demonstrate coupling of a continuum model (finite element methods, FEM) with a neural operator representation of a particle system (Smoothed Particle Hydrodynamics, SPH) for a uniaxial tension problem with hyperelastic material. What makes this approach unique is that a well-trained over-parametrized DeepONet can generalize well and make predictions at a negligible cost.
    A Domain-Theoretic Framework for Robustness Analysis of Neural Networks. (arXiv:2203.00295v1 [cs.LG])
    We present a domain-theoretic framework for validated robustness analysis of neural networks. We first analyze the global robustness of a general class of networks. Then, using the fact that, over finite-dimensional Banach spaces, the domain-theoretic L-derivative coincides with Clarke's generalized gradient, we extend our framework for attack-agnostic local robustness analysis. Our framework is ideal for designing algorithms which are correct by construction. We exemplify this claim by developing a validated algorithm for estimation of Lipschitz constant of feedforward regressors. We prove the completeness of the algorithm over differentiable networks, and also over general position ReLU networks. Within our domain model, differentiable and non-differentiable networks can be analyzed uniformly. We implement our algorithm using arbitrary-precision interval arithmetic, and present the results of some experiments. Our implementation is truly validated, as it handles floating-point errors as well.
    DeepNet: Scaling Transformers to 1,000 Layers. (arXiv:2203.00555v1 [cs.CL])
    In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.
    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction. (arXiv:2203.00101v1 [cs.SE])
    In this paper, we present ApacheJIT, a large dataset for Just-In-Time defect prediction. ApacheJIT consists of clean and bug-inducing software changes in popular Apache projects. ApacheJIT has a total of 106,674 commits (28,239 bug-inducing and 78,435 clean commits). Having a large number of commits makes ApacheJIT a suitable dataset for machine learning models, especially deep learning models that require large training sets to effectively generalize the patterns present in the historical data to future data. In addition to the original dataset, we also present carefully selected training and test sets that we recommend to be used in training and evaluating machine learning models.
  • Open

    MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA. (arXiv:2102.13347v3 [stat.ML] UPDATED)
    Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these MDA versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-MDA empirically outperforms its competitors on both simulated and real data for variable selection. An open source implementation in R and C++ is available online.
    A Probabilistic Deep Image Prior for Computational Tomography. (arXiv:2203.00479v1 [eess.IV])
    Existing deep-learning based tomographic image reconstruction methods do not provide accurate estimates of reconstruction uncertainty, hindering their real-world deployment. To address this limitation, we construct a Bayesian prior for tomographic reconstruction, which combines the classical total variation (TV) regulariser with the modern deep image prior (DIP). Specifically, we use a change of variables to connect our prior beliefs on the image TV semi-norm with the hyper-parameters of the DIP network. For the inference, we develop an approach based on the linearised Laplace method, which is scalable to high-dimensional settings. The resulting framework provides pixel-wise uncertainty estimates and a marginal likelihood objective for hyperparameter optimisation. We demonstrate the method on synthetic and real-measured high-resolution $\mu$CT data, and show that it provides superior calibration of uncertainty estimates relative to previous probabilistic formulations of the DIP.
    Side-effects of Learning from Low Dimensional Data Embedded in an Euclidean Space. (arXiv:2203.00614v1 [cs.LG])
    The low dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low dimensional manifolds embedded in a high dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Mixed Variational Inference. (arXiv:1901.04791v4 [stat.ML] UPDATED)
    The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors.
    Machine Learning for Particle Flow Reconstruction at CMS. (arXiv:2203.00330v1 [physics.data-an])
    We provide details on the implementation of a machine-learning based particle flow algorithm for CMS. The standard particle flow algorithm reconstructs stable particles based on calorimeter clusters and tracks to provide a global event reconstruction that exploits the combined information of multiple detector subsystems, leading to strong improvements for quantities such as jets and missing transverse energy. We have studied a possible evolution of particle flow towards heterogeneous computing platforms such as GPUs using a graph neural network. The machine-learned PF model reconstructs particle candidates based on the full list of tracks and calorimeter clusters in the event. For validation, we determine the physics performance directly in the CMS software framework when the proposed algorithm is interfaced with the offline reconstruction of jets and missing transverse energy. We also report the computational performance of the algorithm, which scales approximately linearly in runtime and memory usage with the input size.  ( 2 min )
    Minibatch vs Local SGD for Heterogeneous Distributed Learning. (arXiv:2006.04735v5 [cs.LG] UPDATED)
    We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.  ( 2 min )
    Weighted Training for Cross-Task Learning. (arXiv:2105.14095v2 [cs.LG] UPDATED)
    In this paper, we introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning based on minimizing a representation-based task distance between the source and target tasks. We show that TAWT is easy to implement, is computationally efficient, requires little hyperparameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. The effectiveness of TAWT is corroborated through extensive experiments with BERT on four sequence tagging tasks in natural language processing (NLP), including part-of-speech (PoS) tagging, chunking, predicate detection, and named entity recognition (NER). As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning, such as the choice of the source data and the impact of fine-tuning.  ( 2 min )
    Continual Learning for Monolingual End-to-End Automatic Speech Recognition. (arXiv:2112.09427v2 [eess.AS] UPDATED)
    Adapting Automatic Speech Recognition (ASR) models to new domains leads to a deterioration of performance on the original domain(s), a phenomenon called Catastrophic Forgetting (CF). Even monolingual ASR models cannot be extended to new accents, dialects, topics, etc. without suffering from CF, making them unable to be continually enhanced without storing all past data. Fortunately, Continual Learning (CL) methods, which aim to enable continual adaptation while overcoming CF, can be used. In this paper, we implement an extensive number of CL methods for End-to-End ASR and test and compare their ability to extend a monolingual Hybrid CTC-Transformer model across four new tasks. We find that the best performing CL method closes the gap between the fine-tuned model (lower bound) and the model trained jointly on all tasks (upper bound) by more than 40%, while requiring access to only 0.6% of the original data.  ( 2 min )
    Parameter-free Mirror Descent. (arXiv:2203.00444v1 [cs.LG])
    We develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve similar results. We also apply our mirror descent framework to build new parameter-free implicit updates, as well as a simplified and improved unconstrained scale-free algorithm.  ( 2 min )
    RAB: Provable Robustness Against Backdoor Attacks. (arXiv:2003.08904v6 [cs.LG] UPDATED)
    Recent studies have shown that deep neural networks are vulnerable to adversarial attacks, including evasion and backdoor (poisoning) attacks. On the defense side, there have been intensive efforts on improving both empirical and provable robustness against evasion attacks; however, provable robustness against backdoor attacks still remains largely unexplored. In this paper, we focus on certifying the machine learning model robustness against general threat models, especially backdoor attacks. We first provide a unified framework via randomized smoothing techniques and show how it can be instantiated to certify the robustness against both evasion and backdoor attacks. We then propose the first robust training process, RAB, to smooth the trained model and certify its robustness against backdoor attacks. We theoretically prove the robustness bound for machine learning models trained with RAB, and prove that our robustness bound is tight. We derive the robustness conditions for different smoothing distributions including Gaussian and uniform distributions. In addition, we theoretically show that it is possible to train the robust smoothed models efficiently for simple models such as K-nearest neighbor classifiers, and we propose an exact smooth-training algorithm which eliminates the need to sample from a noise distribution for such models. Empirically, we conduct comprehensive experiments for different machine learning models such as DNNs and K-NN models on MNIST, CIFAR-10, and ImageNette datasets and provide the first benchmark for certified robustness against backdoor attacks. In addition, we evaluate K-NN models on a spambase tabular dataset to demonstrate the advantages of the proposed exact algorithm. Both the theoretic analysis and the comprehensive evaluation on diverse ML models and datasets shed lights on further robust learning strategies against general training time attacks.  ( 3 min )
    The Concordance Index decomposition -- A measure for a deeper understanding of survival prediction models. (arXiv:2203.00144v1 [cs.LG])
    The Concordance Index (C-index) is a commonly used metric in Survival Analysis to evaluate how good a prediction model is. This paper proposes a decomposition of the C-Index into a weighted harmonic mean of two quantities: one for ranking observed events versus other observed events, and the other for ranking observed events versus censored cases. This decomposition allows a more fine-grained analysis of the pros and cons of survival prediction methods. The utility of the decomposition is demonstrated using three benchmark survival analysis models (Cox Proportional Hazard, Random Survival Forest, and Deep Adversarial Time-to-Event Network) together with a new variational generative neural-network-based method (SurVED), which is also proposed in this paper. The demonstration is done on four publicly available datasets with varying censoring levels. The analysis with the C-index decomposition shows that all methods essentially perform equally well when the censoring level is high because of the dominance of the term measuring the ranking of events versus censored cases. In contrast, some methods deteriorate when the censoring level decreases because they do not rank the events versus other events well.  ( 2 min )
    On the Generalization of Representations in Reinforcement Learning. (arXiv:2203.00543v1 [cs.LG])
    In reinforcement learning, state representations are used to tractably deal with large problem spaces. State representations serve both to approximate the value function with few parameters, but also to generalize to newly encountered states. Their features may be learned implicitly (as part of a neural network) or explicitly (for example, the successor representation of \citet{dayan1993improving}). While the approximation properties of representations are reasonably well-understood, a precise characterization of how and when these representations generalize is lacking. In this work, we address this gap and provide an informative bound on the generalization error arising from a specific state representation. This bound is based on the notion of effective dimension which measures the degree to which knowing the value at one state informs the value at other states. Our bound applies to any state representation and quantifies the natural tension between representations that generalize well and those that approximate well. We complement our theoretical results with an empirical survey of classic representation learning methods from the literature and results on the Arcade Learning Environment, and find that the generalization behaviour of learned representations is well-explained by their effective dimension.  ( 2 min )
    On Testability and Goodness of Fit Tests in Missing Data Models. (arXiv:2203.00132v1 [stat.ME])
    Significant progress has been made in developing identification and estimation techniques for missing data problems where modeling assumptions can be described via a directed acyclic graph. The validity of results using such techniques rely on the assumptions encoded by the graph holding true; however, verification of these assumptions has not received sufficient attention in prior work. In this paper, we provide new insights on the testable implications of three broad classes of missing data graphical models, and design goodness-of-fit tests around them. The classes of models explored are: sequential missing-at-random and missing-not-at-random models which can be used for modeling longitudinal studies with dropout/censoring, and a kind of no self-censoring model which can be applied to cross-sectional studies and surveys.  ( 2 min )
    Neural Score Matching for High-Dimensional Causal Inference. (arXiv:2203.00554v1 [stat.ML])
    Traditional methods for matching in causal inference are impractical for high-dimensional datasets. They suffer from the curse of dimensionality: exact matching and coarsened exact matching find exponentially fewer matches as the input dimension grows, and propensity score matching may match highly unrelated units together. To overcome this problem, we develop theoretical results which motivate the use of neural networks to obtain non-trivial, multivariate balancing scores of a chosen level of coarseness, in contrast to the classical, scalar propensity score. We leverage these balancing scores to perform matching for high-dimensional causal inference and call this procedure neural score matching. We show that our method is competitive against other matching approaches on semi-synthetic high-dimensional datasets, both in terms of treatment effect estimation and reducing imbalance.  ( 2 min )
    MM for Penalized Estimation. (arXiv:1912.11119v3 [stat.CO] UPDATED)
    Penalized estimation can conduct variable selection and parameter estimation simultaneously. The general framework is to minimize a loss function subject to a penalty designed to generate sparse variable selection. The majorization-minimization (MM) algorithm is a computational scheme for stability and simplicity, and the MM algorithm has been widely applied in penalized estimation. Much of the previous work have focused on convex loss functions such as generalized linear models. When data are contaminated with outliers, robust loss functions can generate more reliable estimates. Recent literature has witnessed a growing impact of nonconvex loss-based methods, which can generate robust estimation for data contaminated with outliers. This article investigates MM algorithm for penalized estimation, provide innovative optimality conditions and establish convergence theory with both convex and nonconvex loss functions. With respect to applications, we focus on several nonconvex loss functions, which were formerly studied in machine learning for regression and classification problems. Performance of the proposed algorithms are evaluated on simulated and real data including healthcare costs and cancer clinical status. Efficient implementations of the algorithms are available in the R package mpath in CRAN.  ( 2 min )
    TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions. (arXiv:2001.11212v2 [stat.ML] UPDATED)
    The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual dependence to the property of interest. However, mutual information requires as input probability distributions, which cannot be reliably estimated from continuous distributions such as physical quantities like lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependences that extends mutual information to random variables of continuous distribution based on cumulative probability distributions. TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of variable sets that are nonlinear statistically related to a property of interest, taking into account the number of data samples as well as the cardinality of the set of variables. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate-dependence measures, and demonstrate the effectiveness of our feature-selection method on a set of standard data sets and a typical scenario in materials science.  ( 2 min )
    Making use of supercomputers in financial machine learning. (arXiv:2203.00427v1 [cs.DC])
    This article is the result of a collaboration between Fujitsu and Advestis. This collaboration aims at refactoring and running an algorithm based on systematic exploration producing investment recommendations on a high-performance computer of the Fugaku, to see whether a very high number of cores could allow for a deeper exploration of the data compared to a cloud machine, hopefully resulting in better predictions. We found that an increase in the number of explored rules results in a net increase in the predictive performance of the final ruleset. Also, in the particular case of this study, we found that using more than around 40 cores does not bring a significant computation time gain. However, the origin of this limitation is explained by a threshold-based search heuristic used to prune the search space. We have evidence that for similar data sets with less restrictive thresholds, the number of cores actually used could very well be much higher, allowing parallelization to have a much greater effect.  ( 2 min )
    Causal Structure Learning with Greedy Unconditional Equivalence Search. (arXiv:2203.00521v1 [stat.ML])
    We consider the problem of characterizing directed acyclic graph (DAG) models up to unconditional equivalence, i.e., when two DAGs have the same set of unconditional d-separation statements. Each unconditional equivalence class (UEC) can be uniquely represented with an undirected graph whose clique structure encodes the members of the class. Via this structure, we provide a transformational characterization of unconditional equivalence. Combining these results, we introduce a hybrid algorithm for learning DAG models from observational data, called Greedy Unconditional Equivalence Search (GUES), which first estimates the UEC of the data using independence tests and then greedily searches the UEC for the optimal DAG. Applying GUES on synthetic data, we show that it achieves comparable accuracy to existing methods. However, in contrast to existing methods, since the average UEC is observed to contain few DAGs, the search space for GUES is drastically reduced.  ( 2 min )
    Pairwise Supervision Can Provably Elicit a Decision Boundary. (arXiv:2006.06207v2 [stat.ML] UPDATED)
    Similarity learning is a general problem to elicit useful representations by predicting the relationship between a pair of patterns. This problem is related to various important preprocessing tasks such as metric learning, kernel learning, and contrastive learning. A classifier built upon the representations is expected to perform well in downstream classification; however, little theory has been given in literature so far and thereby the relationship between similarity and classification has remained elusive. Therefore, we tackle a fundamental question: can similarity information provably leads a model to perform well in downstream classification? In this paper, we reveal that a product-type formulation of similarity learning is strongly related to an objective of binary classification. We further show that these two different problems are explicitly connected by an excess risk bound. Consequently, our results elucidate that similarity learning is capable of solving binary classification by directly eliciting a decision boundary.  ( 2 min )
    Riemannian statistics meets random matrix theory: towards learning from high-dimensional covariance matrices. (arXiv:2203.00204v1 [math.ST])
    Riemannian Gaussian distributions were initially introduced as basic building blocks for learning models which aim to capture the intrinsic structure of statistical populations of positive-definite matrices (here called covariance matrices). While the potential applications of such models have attracted significant attention, a major obstacle still stands in the way of these applications: there seems to exist no practical method of computing the normalising factors associated with Riemannian Gaussian distributions on spaces of high-dimensional covariance matrices. The present paper shows that this missing method comes from an unexpected new connection with random matrix theory. Its main contribution is to prove that Riemannian Gaussian distributions of real, complex, or quaternion covariance matrices are equivalent to orthogonal, unitary, or symplectic log-normal matrix ensembles. This equivalence yields a highly efficient approximation of the normalising factors, in terms of a rather simple analytic expression. The error due to this approximation decreases like the inverse square of dimension. Numerical experiments are conducted which demonstrate how this new approximation can unlock the difficulties which have impeded applications to real-world datasets of high-dimensional covariance matrices. The paper then turns to Riemannian Gaussian distributions of block-Toeplitz covariance matrices. These are equivalent to yet another kind of random matrix ensembles, here called "acosh-normal" ensembles. Orthogonal and unitary "acosh-normal" ensembles correspond to the cases of block-Toeplitz with Toeplitz blocks, and block-Toeplitz (with general blocks) covariance matrices, respectively.  ( 2 min )
    Provable Lifelong Learning of Representations. (arXiv:2110.14098v2 [cs.LG] UPDATED)
    In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension $d$, $m$ tasks with $k$ features up to error $\epsilon$ is $\tilde{O}(dk^{1.5}/\epsilon+km/\epsilon)$. We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.  ( 2 min )
    Last Layer Marginal Likelihood for Invariance Learning. (arXiv:2106.07512v2 [stat.ML] UPDATED)
    Data augmentation is often used to incorporate inductive biases into models. Traditionally, these are hand-crafted and tuned with cross validation. The Bayesian paradigm for model selection provides a path towards end-to-end learning of invariances using only the training data, by optimising the marginal likelihood. Computing the marginal likelihood is hard for neural networks, but success with tractable approaches that compute the marginal likelihood for the last layer only raises the question of whether this convenient approach might be employed for learning invariances. We show partial success on standard benchmarks, in the low-data regime and on a medical imaging dataset by designing a custom optimisation routine. Introducing a new lower bound to the marginal likelihood allows us to perform inference for a larger class of likelihood functions than before. On the other hand, we demonstrate failure modes on the CIFAR10 dataset, where the last layer approximation is not sufficient due to the increased complexity of our neural network. Our results indicate that once more sophisticated approximations become available the marginal likelihood is a promising approach for invariance learning in neural networks.  ( 2 min )
    When AUC meets DRO: Optimizing Partial AUC for Deep Learning with Non-Convex Convergence Guarantee. (arXiv:2203.00176v1 [cs.LG])
    In this paper, we propose systematic and efficient gradient-based methods for both one-way and two-way partial AUC (pAUC) maximization that are applicable to deep learning. We propose new formulations of pAUC surrogate objectives by using the distributionally robust optimization (DRO) to define the loss for each individual positive data. We consider two formulations of DRO, one of which is based on conditional-value-at-risk (CVaR) that yields a non-smooth but exact estimator for pAUC, and another one is based on a KL divergence regularized DRO that yields an inexact but smooth (soft) estimator for pAUC. For both one-way and two-way pAUC maximization, we propose two algorithms and prove their convergence for optimizing their two formulations, respectively. Experiments demonstrate the effectiveness of the proposed algorithms for pAUC maximization for deep learning on various datasets.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v1 [cs.LG])
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Amortized Proximal Optimization. (arXiv:2203.00089v1 [cs.LG])
    We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low precision training.  ( 2 min )
    Disentangled Inference for GANs with Latently Invertible Autoencoder. (arXiv:1906.08090v4 [cs.LG] UPDATED)
    Generative Adversarial Networks (GANs) play an increasingly important role in machine learning. However, there is one fundamental issue hindering their practical applications: the absence of capability for encoding real-world samples. The conventional way of addressing this issue is to learn an encoder for GAN via Variational Auto-Encoder (VAE). In this paper, we show that the entanglement of the latent space for the VAE/GAN framework poses the main challenge for encoder learning. To address the entanglement issue and enable inference in GAN we propose a novel algorithm named Latently Invertible Autoencoder (LIA). The framework of LIA is that an invertible network and its inverse mapping are symmetrically embedded in the latent space of VAE. The decoder of LIA is first trained as a standard GAN with the invertible network and then the partial encoder is learned from a disentangled autoencoder by detaching the invertible network from LIA, thus avoiding the entanglement problem caused by the random latent space. Experiments conducted on the FFHQ face dataset and three LSUN datasets validate the effectiveness of LIA/GAN.  ( 2 min )
    Variational Autoencoders Without the Variation. (arXiv:2203.00645v1 [cs.LG])
    Variational autoencdoers (VAE) are a popular approach to generative modelling. However, exploiting the capabilities of VAEs in practice can be difficult. Recent work on regularised and entropic autoencoders have begun to explore the potential, for generative modelling, of removing the variational approach and returning to the classic deterministic autoencoder (DAE) with additional novel regularisation methods. In this paper we empirically explore the capability of DAEs for image generation without additional novel methods and the effect of the implicit regularisation and smoothness of large networks. We find that DAEs can be used successfully for image generation without additional loss terms, and that many of the useful properties of VAEs can arise implicitly from sufficiently large convolutional encoders and decoders when trained on CIFAR-10 and CelebA.  ( 2 min )
    Learning Low-Dimensional Nonlinear Structures from High-Dimensional Noisy Data: An Integral Operator Approach. (arXiv:2203.00126v1 [stat.ML])
    We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from high-dimensional and noisy observations, where the datasets are assumed to be sampled from an intrinsically low-dimensional manifold and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension and size of the samples are comparably large, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Numerical simulations and analysis of three real datasets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various manifolds in diverse applications.  ( 2 min )
    FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control. (arXiv:2201.10936v3 [cs.SD] UPDATED)
    Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.  ( 2 min )
    E-LMC: Extended Linear Model of Coregionalization for Predictions of Spatial Fields. (arXiv:2203.00525v1 [cs.LG])
    Physical simulations based on partial differential equations typically generate spatial fields results, which are utilized to calculate specific properties of a system for engineering design and optimization. Due to the intensive computational burden of the simulations, a surrogate model mapping the low-dimensional inputs to the spatial fields are commonly built based on a relatively small dataset. To resolve the challenge of predicting the whole spatial field, the popular linear model of coregionalization (LMC) can disentangle complicated correlations within the high-dimensional spatial field outputs and deliver accurate predictions. However, LMC fails if the spatial field cannot be well approximated by a linear combination of base functions with latent processes. In this paper, we extend LMC by introducing an invertible neural network to linearize the highly complex and nonlinear spatial fields such that the LMC can easily generalize to nonlinear problems while preserving the traceability and scalability. Several real-world applications demonstrate that E-LMC can exploit spatial correlations effectively, showing a maximum improvement of about 40% over the original LMC and outperforming the other state-of-the-art spatial field models.  ( 2 min )
    Sample Complexity versus Depth: An Information Theoretic Analysis. (arXiv:2203.00246v1 [cs.LG])
    Deep learning has proven effective across a range of data sets. In light of this, a natural inquiry is: "for what data generating processes can deep learning succeed?" In this work, we study the sample complexity of learning multilayer data generating processes of a sort for which deep neural networks seem to be suited. We develop general and elegant information-theoretic tools that accommodate analysis of any data generating process -- shallow or deep, parametric or nonparametric, noiseless or noisy. We then use these tools to characterize the dependence of sample complexity on the depth of multilayer processes. Our results indicate roughly linear dependence on depth. This is in contrast to previous results that suggest exponential or high-order polynomial dependence.  ( 2 min )
    An Explainable AI System for the Diagnosis of High Dimensional Biomedical Data. (arXiv:2107.01820v2 [cs.LG] UPDATED)
    Typical state of the art flow cytometry data samples consists of measures of more than 100.000 cells in 10 or more features. AI systems are able to diagnose such data with almost the same accuracy as human experts. However, there is one central challenge in such systems: their decisions have far-reaching consequences for the health and life of people, and therefore, the decisions of AI systems need to be understandable and justifiable by humans. In this work, we present a novel explainable AI method, called ALPODS, which is able to classify (diagnose) cases based on clusters, i.e., subpopulations, in the high-dimensional data. ALPODS is able to explain its decisions in a form that is understandable for human experts. For the identified subpopulations, fuzzy reasoning rules expressed in the typical language of domain experts are generated. A visualization method based on these rules allows human experts to understand the reasoning used by the AI system. A comparison to a selection of state of the art explainable AI systems shows that ALPODS operates efficiently on known benchmark data and also on everyday routine case data.  ( 2 min )
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v1 [cs.LG])
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.  ( 2 min )
    Information Field Theory and Artificial Intelligence. (arXiv:2112.10133v2 [stat.ML] UPDATED)
    Information field theory (IFT), the information theory for fields, is a mathematical framework for signal reconstruction and non-parametric inverse problems. Artificial intelligence (AI) and machine learning (ML) aim at generating intelligent systems including such for perception, cognition, and learning. This overlaps with IFT, which is designed to address perception, reasoning, and inference tasks. Here, the relation between concepts and tools in IFT and those in AI and ML research are discussed. In the context of IFT, fields denote physical quantities that change continuously as a function of space (and time) and information theory refers to Bayesian probabilistic logic equipped with the associated entropic information measures. Reconstructing a signal with IFT is a computational problem similar to training a generative neural network (GNN) in ML. In this paper, the process of inference in IFT is reformulated in terms of GNN training. In contrast to classical neural networks, IFT based GNNs can operate without pre-training thanks to incorporating expert knowledge into their architecture. Furthermore, the cross-fertilization of variational inference methods used in IFT and ML are discussed. These discussions suggests that IFT is well suited to address many problems in AI and ML research and application.  ( 2 min )

  • Open

    [D] Apple AI/ML Residency 2022
    To any current residents or ex residents at Apple - I was curious as to on whether you've had an overall positive experience from it - e.g. did you find that the program was actually well organized to gear you towards research/publishing or if it was the case that you were left on your own to find projects, contact people. Any other sort of feedback will also be greatly appreciated. Also, to anyone else who applied this year: have you heard back the results? submitted by /u/SnooSuggestions6394 [link] [comments]  ( 1 min )
    [D] AlphaCode - Interview with the authors (video)
    https://youtu.be/C5sWbYwzKyg I'm very happy to present this interview with Rémi Leblond and Peter Choy, two co-creators of AlphaCode! Learn the story behind the paper, whether these models actually understand the code they write, and what the future of AI code systems has in store. OUTLINE: 0:00 - Intro 1:10 - Media Reception 5:10 - How did the project go from start to finish? 9:15 - Does the model understand its own code? 14:45 - Are there plans to reduce the number of samples? 16:15 - Could one do smarter filtering of samples? 18:55 - How crucial are the public test cases? 21:55 - Could we imagine an adversarial method? 24:45 - How are coding problems even made? 27:40 - Does AlphaCode evaluate a solution's asymptotic complexity? 33:15 - Are our sampling procedures inappropriate for diversity? 36:30 - Are all generated solutions as instructive as the example? 41:30 - How are synthetic examples created during training? 42:30 - What were high and low points during this research? 45:25 - What was the most valid criticism after publication? 47:40 - What are applications in the real world? 51:00 - Where do we go from here? ​ Paper: https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf Code: https://github.com/deepmind/code_contests Paper review video here: https://youtu.be/s9UAOmyah1A submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [P] Generate photos of outdoor deck restoration
    I have a heap of before and after photos of deck restorations - timber decking that looks old and weathered that we restore and it looks like new again. I was thinking that I could train a model using the before and after photos so that new clients could generate a fictitious "after" photo to help win them over. I'm reasonable with Python, C/C++ but not done any ML yet. Just wondering if you folks could point me in the right direction in terms of Python libraries / models / research that might be useful in this project. Photos at lovethedeck.com.au to see the dataset we're working with. submitted by /u/cmskipsey [link] [comments]  ( 1 min )
    [D] Help Ukraine by countering deepfakes advice/help needed
    Hi, I was reading about the potential havoc Russia could cause in Ukraine by deepfaking Zelazny surendering or giving heinous orders. I thought we could preempt that by creating a deepfake of Vladimir Putin explaining how much he enjoys fucking animals which would end with a warning about deepfakes. If it went viral a lot of people might get inoculated in a way against the damage Russia could cause with this. If you can't find the courage to post it yourself, send it over and I can do it, I'm not afraid of that little fuck. And to clarify I don't have the skillset to do this myself, that's why I'm asking for advice and help here. submitted by /u/Krad23 [link] [comments]  ( 1 min )
    [N] Meta FAIR released and improved version of their Self Supervised CV model
    This is a beefed version of SEER which was released a year ago, scaled from 1B to 10B parameters, showing improved generalization on different tasks. Also, the self supervised learning (SSL) allowed for a better coverage of the world , thus reducing bias from labelled datasets mostly originating from specific countries (e.g. US). Results look very nice. More details in their post. submitted by /u/k-folder [link] [comments]  ( 1 min )
    [D] W&B vs. Neptune vs. Comet (2022)
    What is the state/opinion of the field on these three competing services as of the present? I feel there are significant overlaps but how is the user experience. (I have used W&B and it seems okay'ish - just an extension of Tensorboard). Interested to hear community's thoughts submitted by /u/mlbloke [link] [comments]  ( 1 min )
    [D] SSL-Dimensionality Collapse
    I am reading one the recent papers by Yan https://arxiv.org/abs/2110.09348 about dimensionality reduction collapse in Self Supervised learning methods. I am having difficulty in understanding the complete and dimensional collapse. Please can someone help me in understanding these concepts with some examples or if you can share some useful resources? Thanks submitted by /u/MushiML [link] [comments]  ( 2 min )
    [D] What's your favorite unpopular/forgotten Machine Learning method?
    It seems there's a lot of attention (ha ha) on developing the most promising methods/models in Machine Learning, but there are a lot of less popular methods that fly under the radar or die out. I want to learn more about the nooks-and-crannies of ML techniques, so in this spirit I have a few questions for discussion! What's your favorite unpopular Machine Learning method? Are there any methods that you think died out before they reached their full potential? Are there any uncommon methods you know of that are really good at a very niche task? More generally, do you think there is a lack of creativity in ML right now with respect to big-picture thinking? I.e. everyone is too focused on improving current models to publish something (publish or perish) at the cost of unfound paradigm shifts? I don't really know where this discussion could go, just wanted to see what everyone had to say :) submitted by /u/SleekEagle [link] [comments]  ( 5 min )
    [D] MCDropout and CNNs
    Hello, I'm working on implementing MCDropout on a model of mine, in order to obtain an estimation of uncertainty on the predictions. This is particularly important, as I'm dealing with a regression problem. I'm referring to the method proposed by Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. If I understand correctly, it is basically necessary to include a dropout layer after every layer, and then leave it active also during inferece. So far so good, easy to implement. My doubts arise from the fact that I'm working with a CNN (in particular a ResNet). As far as I know, dropout is not used in convolution layers, and if required something like SpatialDropout2D (tensorflow) or Dropout2D (pytorch) is usually employed. I wonder if for the convolutional part of the neural network it would be better to do. Should I use a regular dropout layer, as the paper proves that the regular dropout leads to a network equivalent to a Bayesian Neural Netowork in a Variational Inference approach? Or does the proof work also for a 2D dropout, and in that case should I use such a dropout for the convolutional part of the NN? Another question. How do I evaluate the performance of the neural network as a function of the dropout rate? If I were to use a regular neural network, I would be able to test the NN on a validation dataset and find an optimal dropout rate by minimizing whichever criterion I choose. In this case, since the prediction of the neural network has an uncertainty estimation, how do you estimate which is the best model? I would imagine the uncertainty should play a role too. I found a piece of code by Gal, who defines a log likelihood, and they use that to determine the best model: https://github.com/yaringal/DropoutUncertaintyExps/blob/master/net/net.py. Is that somewhat the standard approach? Thank you very much for your help! submitted by /u/TrPhantom8 [link] [comments]  ( 3 min )
    [Project] Doing multitouch attribution in marketing analytics using shapley values
    Hi folks, we recently implemented multi-touch attribution using shapley and markov chain values at our org, and I wrote a blog post about how we implemented it using a mix of tools (primarily dbt, sagemaker, and our internal tools). I am sharing it here hoping people might find it interesting. Do let me know if you have any questions/feedback/suggestions. https://www.rudderstack.com/blog/from-first-touch-to-multi-touch-attribution-with-rudderstack-dbt-and-sagemaker/ submitted by /u/dileep31 [link] [comments]  ( 1 min )
    [R] PolyCoder 2.7BN LLM - open source model and parameters {CMU}
    submitted by /u/yazriel0 [link] [comments]  ( 1 min )
    [D] Tackling propaganda and misinformation with ML
    I know that certain social networks are likely already using ML/Statistics to identify "fake news" and add a warning sign next to them. However, that depends on the network to implement it. I was wondering if there are any attempts at automatically identifying misinformation across publicly accessible platforms (e.g. reddit comments section, Twitter replies, 9GAG comments, discussion boards of newspapers), or even providing automatic replies with fact-based replies/corrections on top. Now this strikes me as a pretty obvious idea, however I haven't heard about anything in this direction yet, and feel like this could actually be one of the ML areas which may create a wide-range public benefit. submitted by /u/Zahlii [link] [comments]  ( 9 min )
    [D] Locating consistent marker in images
    Hello, I am currently working on a project where I need to locate a marker (it's up to me to choose the marker) on a plate with food. Here are (shitty food porn) examples for how the input images might look like: Plate with marker (black) 1 Plate with marker (black) 2 So in the examples, the black marker would need to get localized (e.g. bounding box or the central point or whatever). The following assumptions can be made: The same marker and plate get used across all images and the marker is always visible. Things that can change between images: Angle from which the plate gets looked at, distance from which the plate gets looked at, lighting conditions, the food, etc. ​ So my question is: Does anyone have good ideas for how to localize the marker (as mentioned at the start, I can pick anything as a marker)? I thought of maybe searching for a labeled image dataset that shows the same object from different conditions, print that object on the plate as the marker and use a model trained on said dataset as the localizer, but would like to hear about other options. Also, ofc non-ML methods are welcome! Thank you in advance :) submitted by /u/DontDoMethButMath [link] [comments]  ( 1 min )
    [R] pytorch library for audio/speech domain adaptation?
    Are there any pytorch libraries to do benchmarking of domain adaptation methods for audio/speech tasks? Something like the Transfer Learning Library (https://github.com/thuml/Transfer-Learning-Library/) for images. submitted by /u/kiran__chari [link] [comments]
    [D] Waabi High Fidelity Simulation
    I would like to know your opinions about Waabi world for high fidelity simulation, what do you think of the following release https://youtu.be/M7xIn2iNytE submitted by /u/meldiwin [link] [comments]
    [D] What are State-of-the-Art decoding method in long text generation?
    Or are there any 2022 Survey of decoding method in long text generation(more than 512 tokens)? I just found this 2021 Survey: Decoding Methods in Neural Language Generation: A Survey , it mostly about Top-p/Top-k/beam-search, and https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/#decoding-strategies, but it just about methods in 2020 and before. This latest method seems interesting: Typical Decoding for Natural Language Generation, but i can't reproduce the result in the paper, when test this method, the generated text is worse than old methods like Top-p... There should be some more advanced more better decoding method for long text generation than Top-p/Top-k/beam-search, i think the decoding method is too important for long text generation, different method or different decoding hyperparams can result very different quality texts by the same model especially for long text generation. Large Models like GPT-2 read articles/books more than every single human's whole life, and they know much more knowledge/patterns about the text than every single human, so if we have much better decoding method, we maybe can no need to redesign/retrain/finetune the large models, just like prompt-learning NLG, maybe we can do decoding-centric NLG, it is much cheap than redesign/retrain/finetune large models. So can you suggest some better methods than Top-p/Top-k/beam-search? or do you find some very good hyperparams settings in Top-p/Top-k/beam-search methods for long text generation to maximize the quality of text fluency/coherence/relevance/diversity in GPT-2 size like models? (For large scale model GPT-3, maybe it can generate long text(2048 tokens) with incredible quality with Top-p/Top-k/beam-search, but it is a too big model to study for many researchers.) submitted by /u/ghosthamlet [link] [comments]  ( 1 min )
    [D] Important question for all data scientists/machine learning developers. How are you verifying whether a model is impacting KPIs and actually providing business/consumer value (i.e. the actual impact of the ML efforts)?
    Basically the title. As of 2022, how are people measuring the direct impacts of their machine learning models on KPIs? From my experience, I too often build and deploy the model and assume it works. What's the current way you guys are running experiments/validating and de-validating models based on KPIs/and iterating? submitted by /u/Adi-Dewan [link] [comments]  ( 1 min )
  • Open

    Implementing ONNX models in Rails
    submitted by /u/Kagermanov [link] [comments]
    Yuval Noah Harari: "All hopes of stopping the AI arms race will go up in smoke if this war continues."
    submitted by /u/justine01923 [link] [comments]
    Researchers at UC Berkeley Introduce a New Competence-Based Algorithm Called Contrastive Intrinsic Control (CIC) For Unsupervised Skill Discovery
    In the presence of extrinsic rewards, Deep Reinforcement Learning (RL) is a strong strategy for tackling complex control tasks. Playing video games with pixels, mastering the game of Go, robotic mobility, and dexterous manipulation policies are all examples of successful applications. While effective, the above advancements resulted in agents that were unable to generalize to new downstream tasks other than the one for which they were trained. Humans and animals, on the other hand, can learn skills and apply them to a range of downstream activities with little supervision. In a recent paper, UC Berkeley researchers aim to teach agents with generalization capabilities by efficiently adapting their skills to downstream tasks. Continue reading my summary on this paper Paper | Github https://preview.redd.it/79tv0v03d0l81.png?width=1370&format=png&auto=webp&s=fb04eddc7760cd76ce75d13186dcf0cb54d55af3 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Shachindra Kumar, Founder & CEO at Lazarus Network, joins Yanni Collins in the latest AIBC interview!
    submitted by /u/thedyezwfl [link] [comments]
    Why AI Isn’t Providing Better Product Recommendations
    submitted by /u/DaveBowman1975 [link] [comments]
    The best library for recognizing rectangular objects of the same size?
    Hello everyone, I need to train an AI to recognize a rectangle object that is always the same size but it's colors and texture can change. I am planning to train it using DLib. Do you guys think this is the right choice? Should I try opencv? I think it would be impossible to train Google's MediaPipe for this. Am I right? In short, which library/algorithm do you guys think I should use to train AI to recognize a rectangular object that can be any color (or mix of colors/stripes/texts/etc) Thanks a lot! submitted by /u/RangerHere [link] [comments]  ( 1 min )
    Learning or working with AI? Come join us, we are a Discord Community with close to 22'000 members! Ask questions, find teammates, share your projects, participate in events and projects, and much more!
    Programming is way more fun when you learn/work with someone. Help each other, ask questions, brainstorm, etc. There is just so much benefit to joining a community when you are in this field, especially when you cannot find the question you are looking for on stack overflow! 😉 This is the same thing with AI, and it is why, nearly two years ago now, we created a Discord server where anyone learning or working in the field could come and share their projects, learn together, work together, and much more. The community has now over 22'000 members, which is just unbelievable! We are so glad to see it growing and especially to see everyone so active. We would love for anyone to join and exchange with us, especially if you are willing to give some of your precious time to share your knowledge and help other people. We have special events and projects for the community as well as cool offers and giveaways, such as an NVIDIA RTX 3080Ti giveaway running right now in collaboration with NVIDIA for the GTC event for the community members! (check out the #announcement channel for more information about this ;) ) Come join us if you are in the field of AI ! https://discord.gg/learnaitogether submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    LinkedIn Showcases Real-Time Features for Near Real-Time Personalization
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Here is another cool AI Framework and this time from Salesforce where they proposed ‘BLIP’: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    Just how humans use vision and language to experience the environment, AI models are built on the foundations of vision and language. Vision-language pre-training has been widely adopted to enable AI agents to understand the world and communicate with humans. This approach involves training a model on image-text data to teach how to grasp both visual and written information. The model is pretrained before fine-tuning it. Skipping this step would reduce the model’s performance because the model then must be trained from the beginning on each subsequent task. You can continue reading my summary on this research or you can also read the paper and check the codes here https://i.redd.it/23klq6w9exk81.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Do you know any AI that is better than this? Rock, Paper, Scissors
    The way this AI tries to predict my move is simple. So it is pretty easy to pick a move that at least get a draw. At some point I almost thought that the AI pattern may be just hardcoded. https://preview.redd.it/dcef8xoifwk81.png?width=1920&format=png&auto=webp&s=cb7f6b2b90c49dd895e00bbdee0cc8999929fc17 submitted by /u/siubb [link] [comments]
    Google AI Making Progress in Federated Learning with Formal Differential Privacy Guarantees
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Snail-mimetic robot - SnailBOT
    submitted by /u/nousetest [link] [comments]
  • Open

    Beyond Be-leaf: Immersive 3D Experience Transports Audiences to Natural Worlds With Augmented Reality
    Imagine walking through the bustling streets of London’s Piccadilly Circus, when suddenly you’re in a tropical rainforest, surrounded by vibrant flowers and dancing butterflies. That’s what audiences will see in the virtual world of The Green Planet AR Experience, an interactive, augmented reality experience that blends physical and digital worlds to connect people with nature. Read article > The post Beyond Be-leaf: Immersive 3D Experience Transports Audiences to Natural Worlds With Augmented Reality appeared first on NVIDIA Blog.  ( 4 min )
    Podsplainer: What’s a Recommender System? NVIDIA’s Even Oldridge Breaks It Down
    The very thing that makes the internet so useful to so many people — the vast quantity of information that’s out there — can also make going online frustrating. There’s so much available that the sheer volume of choices can be overwhelming. That’s where recommender systems come in, explains NVIDIA AI Podcast host Noah Kravitz. Read article > The post Podsplainer: What’s a Recommender System? NVIDIA’s Even Oldridge Breaks It Down appeared first on NVIDIA Blog.  ( 8 min )
  • Open

    Using Deep Learning to Annotate the Protein Universe
    Posted by Maxwell Bileschi, Staff Software Engineer and Lucy Colwell, Research Scientist, Google Research, Brain Team Proteins are essential molecules found in all living things. They play a central role in our bodies’ structure and function, and they are also featured in many products that we encounter every day, from medications to household items like laundry detergent. Each protein is a chain of amino acid building blocks, and just as an image may include multiple objects, like a dog and a cat, a protein may also have multiple components, which are called protein domains. Understanding the relationship between a protein’s amino acid sequence — for example, its domains — and its structure or function are long-standing challenges with far-reaching scientific implications. An examp…  ( 8 min )
  • Open

    A new resource for teaching responsible technology development
    The Social and Ethical Responsibilities of Computing publishes a collection of original pedagogical materials developed for instructional use on MIT OpenCourseWare.  ( 5 min )
    The benefits of peripheral vision for machines
    Researchers find similarities between how some computer-vision systems process images and how humans see out of the corners of our eyes.  ( 8 min )
  • Open

    A Gentle Introduction to Unit Testing in Python
    Unit testing is a method for testing software which looks at the smallest testable pieces of code, called units, are […] The post A Gentle Introduction to Unit Testing in Python appeared first on Machine Learning Mastery.  ( 15 min )
  • Open

    Researchers at UC Berkeley Introduce a New Competence-Based Algorithm Called Contrastive Intrinsic Control (CIC) For Unsupervised Skill Discovery
    In the presence of extrinsic rewards, Deep Reinforcement Learning (RL) is a strong strategy for tackling complex control tasks. Playing video games with pixels, mastering the game of Go, robotic mobility, and dexterous manipulation policies are all examples of successful applications. While effective, the above advancements resulted in agents that were unable to generalize to new downstream tasks other than the one for which they were trained. Humans and animals, on the other hand, can learn skills and apply them to a range of downstream activities with little supervision. In a recent paper, UC Berkeley researchers aim to teach agents with generalization capabilities by efficiently adapting their skills to downstream tasks. Continue reading my summary on this paper Paper | Github https://preview.redd.it/jeomp8g2d0l81.png?width=1370&format=png&auto=webp&s=24701cfc995eeffcace5a62fb734dce643499487 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Double Deep Q
    Hello, I started learning RL a month ago, but I can't really comprehend why we use Double Deep Q learning ! I understand that the problem with queue learning is that we uptade estimation using estimations dynamically, and those estimation will likely be overestimations and that will cause a problem. But the two classic ways of double Q learning : Uptading the target each n steps or slowling uptading the target to be closer to the the policy don't seem to resolve the problem for me theorically .. All explainations will be appreciated ! submitted by /u/MichelMED10 [link] [comments]  ( 1 min )
    Action space [-1, 1] summing up to 1
    Hey, Im trying to train some agents for a problem where the action space is supposed to be in a range from [-1, 1] with all actions summing up to 1. Bringing the actions in a range from -1 to 1 is easy but I'm struggling with the idea of making them sum up to one over all dimensions. E.g. action space with 3 dimensions: [0.5, -0.25, 0.75] Do you guys have any idea how to solve that problem? submitted by /u/sedidrl [link] [comments]  ( 2 min )
    decentralized vs. centralized training in MARL
    There are many CTDE methods like QMIX and MAPPO and some fully decentralized methods like IPPO,IQL. What are the Pros and cons of CTDE and fully decentralized training? And if one method uses parameter sharing, could it be regarded as fully decentralized? submitted by /u/Traditional-Ad4492 [link] [comments]  ( 1 min )
    [R] PolyCoder 2.7BN LLM - open source model and parameters {CMU}
    submitted by /u/yazriel0 [link] [comments]  ( 1 min )
    Reinforcement learning to generate words?
    Hi all, I am trying to train a bot to play the wordle game. This game will require the agent to input a word with 5 characters. However, I am not sure how shall I format this problem. The action is a 5-character word. It seems not possible to set it up with a 26 ^ 5 action space for a DQN. Shall I treat it as 5 separated actions in a 26 discrete action set? Or is there a better way to format it? I wonder if there is any similar publication, which tries to generate words as actions, that I could use as a reference? Any comment will be welcomed! Thanks! submitted by /u/jadore801120 [link] [comments]  ( 1 min )
  • Open

    Any resources about transformer architectures?
    Hey everyone, I want to get myself more involved with transformers. It can be a blog, a YouTube video, a lecture... do you guys have any recommendations that are not as lengthy as a whole semester worth of information? One can (and probably should) argue that before transformers, attention should be the first target to comprehend, and I am open to that also. submitted by /u/ege6211 [link] [comments]  ( 1 min )
  • Open

    Use of Artificial Intelligence in Stroke Detection
    What is Stroke ?  ( 5 min )
  • Open

    Bundesliga Match Fact Set Piece Threat: Evaluating team performance in set pieces on AWS
    The importance of set pieces in football (or soccer in the US) has been on the rise in recent years: now more than one quarter of all goals are scored via set pieces. Free kicks and corners generally create the most promising situations, and some professional teams have even hired specific coaches for those parts […]  ( 10 min )
    Bundesliga Match Fact Skill: Quantifying football player qualities using machine learning on AWS
    In football, as in many sports, discussions about individual players have always been part of the fun. “Who is the best scorer?” or “Who is the king of defenders?” are questions perennially debated by fans, and social media amplifies this debate. Just consider that Erling Haaland, Robert Lewandowski, and Thomas Müller alone have a combined […]  ( 10 min )
  • Open

    Strong Optimal Classification Trees. (arXiv:2103.15965v2 [cs.LG] UPDATED)
    Decision trees are among the most popular machine learning (ML) models and are used routinely in applications ranging from revenue management and medicine to bioinformatics. In this paper, we consider the problem of learning optimal binary classification trees. Literature on the topic has burgeoned in recent years, motivated both by the empirical suboptimality of heuristic approaches and the tremendous improvements in mixed-integer optimization (MIO) technology. Yet, existing MIO-based approaches from the literature do not leverage the power of MIO to its full extent: they rely on weak formulations, resulting in slow convergence and large optimality gaps. To fill this gap in the literature, we propose an intuitive flow-based MIO formulation for learning optimal binary classification trees. Our formulation can accommodate side constraints to enable the design of interpretable and fair decision trees. Moreover, we show that our formulation has a stronger linear optimization relaxation than existing methods. We exploit the decomposable structure of our formulation and max-flow/min-cut duality to derive a Benders' decomposition method to speed-up computation. We propose a tailored procedure for solving each decomposed subproblem that provably generates facets of the feasible set of the MIO as constraints to add to the main problem. We conduct extensive computational experiments on standard benchmark datasets on which we show that our proposed approaches are 31 times faster than state-of-the-art MIO-based techniques and improve out-of-sample performance by up to 8%.  ( 2 min )
    BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task. (arXiv:2202.13174v1 [cs.CL])
    Motivation: Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. Results: We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the need for generating pseudo labels for training a well-performing biomedical-MRC model. We extensively evaluate the performance of BioADAPT-MRC by comparing it with the best existing methods on three widely used benchmark biomedical-MRC datasets -- BioASQ-7b, BioASQ-8b, and BioASQ-9b. Our results suggest that without using any synthetic or human-annotated data from the biomedical domain, BioADAPT-MRC can achieve state-of-the-art performance on these datasets. Availability: BioADAPT-MRC is freely available as an open-source project at\\https://github.com/mmahbub/BioADAPT-MRC  ( 2 min )
    Multi-Agent Reinforcement Learning for Markov Routing Games: A New Modeling Paradigm For Dynamic Traffic Assignment. (arXiv:2011.10915v2 [cs.LG] UPDATED)
    This paper aims to develop a paradigm that models the learning behavior of intelligent agents (including but not limited to autonomous vehicles, connected and automated vehicles, or human-driven vehicles with intelligent navigation systems where human drivers follow the navigation instructions completely) with a utility-optimizing goal and the system's equilibrating processes in a routing game among atomic selfish agents. Such a paradigm can assist policymakers in devising optimal operational and planning countermeasures under both normal and abnormal circumstances. To this end, we develop a Markov routing game (MRG) in which each agent learns and updates her own en-route path choice policy while interacting with others in transportation networks. To efficiently solve MRG, we formulate it as multi-agent reinforcement learning (MARL) and devise a mean field multi-agent deep Q learning (MF-MA-DQL) approach that captures the competition among agents. The linkage between the classical DUE paradigm and our proposed Markov routing game (MRG) is discussed. We show that the routing behavior of intelligent agents is shown to converge to the classical notion of predictive dynamic user equilibrium (DUE) when traffic environments are simulated using dynamic loading models (DNL). In other words, the MRG depicts DUEs assuming perfect information and deterministic environments propagated by DNL models. Four examples are solved to illustrate the algorithm efficiency and consistency between DUE and the MRG equilibrium, on a simple network without and with spillback, the Ortuzar Willumsen (OW) Network, and a real-world network near Columbia University's campus in Manhattan of New York City.  ( 3 min )
    Do We Need Anisotropic Graph Neural Networks?. (arXiv:2104.01481v3 [cs.LG] UPDATED)
    Common wisdom in the graph neural network (GNN) community dictates that anisotropic models -- in which messages sent between nodes are a function of both the source and target node -- are required to achieve state-of-the-art performance. Benchmarks to date have demonstrated that these models perform better than comparable isotropic models -- where messages are a function of the source node only. In this work we provide empirical evidence challenging this narrative: we propose an isotropic GNN, which we call Efficient Graph Convolution (EGC), that consistently outperforms comparable anisotropic models, including the popular GAT or PNA architectures by using spatially-varying adaptive filters. In addition to raising important questions for the GNN community, our work has significant real-world implications for efficiency. EGC achieves higher model accuracy, with lower memory consumption and latency, along with characteristics suited to accelerator implementation, while being a drop-in replacement for existing architectures. As an isotropic model, it requires memory proportional to the number of vertices in the graph ($\mathcal{O}(V)$); in contrast, anisotropic models require memory proportional to the number of edges ($\mathcal{O}(E)$). We demonstrate that EGC outperforms existing approaches across 6 large and diverse benchmark datasets, and conclude by discussing questions that our work raise for the community going forward. Code and pretrained models for our experiments are provided at https://github.com/shyam196/egc.  ( 2 min )
    How Do Vision Transformers Work?. (arXiv:2202.06709v2 [cs.CV] UPDATED)
    The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.  ( 2 min )
    Initialization of Latent Space Coordinates via Random Linear Projections for Learning Robotic Sensory-Motor Sequences. (arXiv:2202.13057v1 [cs.LG])
    Robot kinematics data, despite being a high dimensional process, is highly correlated, especially when considering motions grouped in certain primitives. These almost linear correlations within primitives allow us to interpret the motions as points drawn close to a union of low-dimensional linear subspaces in the space of all motions. Motivated by results of embedding theory, in particular, generalizations of Whitney embedding theorem, we show that random linear projection of motor sequences into low dimensional space loses very little information about structure of kinematics data. Projected points are very good initial guess for values of latent variables in generative model for robot sensory-motor behaviour primitives. We conducted series of experiments where we trained a recurrent neural network to generate sensory-motor sequences for robotic manipulator with 9 degrees of freedom. Experimental results demonstrate substantial improvement in generalisation abilities for unobserved samples in the case of initialization of latent variables with random linear projection of motor data over initialization with zero or random values. Moreover, latent space is well-structured wherein samples belonging to different primitives are well separated from the onset of training process.  ( 2 min )
    Distribution Preserving Graph Representation Learning. (arXiv:2202.13428v1 [cs.LG])
    Graph neural network (GNN) is effective to model graphs for distributed representations of nodes and an entire graph. Recently, research on the expressive power of GNN attracted growing attention. A highly-expressive GNN has the ability to generate discriminative graph representations. However, in the end-to-end training process for a certain graph learning task, a highly-expressive GNN risks generating graph representations overfitting the training data for the target task, while losing information important for the model generalization. In this paper, we propose Distribution Preserving GNN (DP-GNN) - a GNN framework that can improve the generalizability of expressive GNN models by preserving several kinds of distribution information in graph representations and node representations. Besides the generalizability, by applying an expressive GNN backbone, DP-GNN can also have high expressive power. We evaluate the proposed DP-GNN framework on multiple benchmark datasets for graph classification tasks. The experimental results demonstrate that our model achieves state-of-the-art performances.  ( 2 min )
    Theoretical bounds on data requirements for the ray-based classification. (arXiv:2103.09577v3 [cs.LG] UPDATED)
    The problem of classifying high-dimensional shapes in real-world data grows in complexity as the dimension of the space increases. For the case of identifying convex shapes of different geometries, a new classification framework has recently been proposed in which the intersections of a set of one-dimensional representations, called rays, with the boundaries of the shape are used to identify the specific geometry. This ray-based classification (RBC) has been empirically verified using a synthetic dataset of two- and three-dimensional shapes (Zwolak et al. in Proceedings of Third Workshop on Machine Learning and the Physical Sciences (NeurIPS 2020), Vancouver, Canada [December 11, 2020], arXiv:2010.00500, 2020) and, more recently, has also been validated experimentally (Zwolak et al., PRX Quantum 2:020335, 2021). Here, we establish a bound on the number of rays necessary for shape classification, defined by key angular metrics, for arbitrary convex shapes. For two dimensions, we derive a lower bound on the number of rays in terms of the shape's length, diameter, and exterior angles. For convex polytopes in $\mathbb{R}^N$, we generalize this result to a similar bound given as a function of the dihedral angle and the geometrical parameters of polygonal faces. This result enables a different approach for estimating high-dimensional shapes using substantially fewer data elements than volumetric or surface-based approaches.  ( 2 min )
    S4-Crowd: Semi-Supervised Learning with Self-Supervised Regularisation for Crowd Counting. (arXiv:2108.13969v2 [cs.CV] UPDATED)
    Crowd counting has drawn more attention because of its wide application in smart cities. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate annotation cost, in this work we proposed a semi-supervised learning framework S4-Crowd, which can leverage both unlabeled/labeled data for robust crowd modelling. In the unsupervised pathway, two self-supervised losses were proposed to simulate the crowd variations such as scale, illumination, etc., based on which and the supervised information pseudo labels were generated and gradually refined. We also proposed a crowd-driven recurrent unit Gated-Crowd-Recurrent-Unit (GCRU), which can preserve discriminant crowd information by extracting second-order statistics, yielding pseudo labels with improved quality. A joint loss including both unsupervised/supervised information was proposed, and a dynamic weighting strategy was employed to balance the importance of the unsupervised loss and supervised loss at different training stages. We conducted extensive experiments on four popular crowd counting datasets in semi-supervised settings. Experimental results suggested the effectiveness of each proposed component in our S4-Crowd framework. Our method also outperformed other state-of-the-art semi-supervised learning approaches on these crowd datasets.  ( 2 min )
    Optimal-er Auctions through Attention. (arXiv:2202.13110v1 [cs.LG])
    RegretNet is a recent breakthrough in the automated design of revenue-maximizing auctions. It combines the expressivity of deep learning with the regret-based approach to relax and quantify the Incentive Compatibility constraint (that participants benefit from bidding truthfully). As a follow-up to its success, we propose two independent modifications of RegretNet, namely a new neural architecture based on the attention mechanism, denoted as TransRegret, and an alternative loss function that is interpretable and significantly less sensitive to hyperparameters. We investigate both proposed modifications in an extensive experimental study in settings with fixed and varied input sizes and additionally test out-of-setting generalization of our network. In all experiments, we find that TransRegret consistently outperforms existing architectures in revenue. Regarding our loss modification, we confirm its effectiveness at controlling the revenue-regret trade-off by varying a single interpretable hyperparameter.  ( 2 min )
    Fair-SSL: Building fair ML Software with less data. (arXiv:2111.02038v4 [cs.SE] UPDATED)
    Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a machine learning technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of the data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, the classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models.  ( 2 min )
    Graph Attention Retrospective. (arXiv:2202.13060v1 [cs.LG])
    Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular type of models is graph attention networks. These models were introduced to allow a node to aggregate information from the features of neighbor nodes in a non-uniform way in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here the features of the nodes are obtained from a mixture of Gaussians and the edges from a stochastic block model where the features and the edges are coupled in a natural way. First, we show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention maintains the weights of intra-class edges and significantly reduces the weights of the inter-class edges. As a corollary, we show that this implies perfect node classification independent of the weights of inter-class edges. However, a classical argument shows that in the "easy" regime, the graph is not needed at all to classify the data with high probability. In the "hard" regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. We evaluate our theoretical results on synthetic and real-world data.  ( 2 min )
    Relational Surrogate Loss Learning. (arXiv:2202.13197v1 [cs.LG])
    Evaluation metrics in machine learning are often hardly taken as loss functions, as they could be non-differentiable and non-decomposable, e.g., average precision and F1 score. This paper aims to address this problem by revisiting the surrogate loss learning, where a deep neural network is employed to approximate the evaluation metrics. Instead of pursuing an exact recovery of the evaluation metric through a deep neural network, we are reminded of the purpose of the existence of these evaluation metrics, which is to distinguish whether one model is better or worse than another. In this paper, we show that directly maintaining the relation of models between surrogate losses and metrics suffices, and propose a rank correlation-based optimization method to maximize this relation and learn surrogate losses. Compared to previous works, our method is much easier to optimize and enjoys significant efficiency and performance gains. Extensive experiments show that our method achieves improvements on various tasks including image classification and neural machine translation, and even outperforms state-of-the-art methods on human pose estimation and machine reading comprehension tasks. Code is available at: https://github.com/hunto/ReLoss.  ( 2 min )
    MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization. (arXiv:2010.02318v3 [cs.LG] UPDATED)
    Molecule optimization is a fundamental task for accelerating drug discovery, with the goal of generating new valid molecules that maximize multiple drug properties while maintaining similarity to the input molecule. Existing generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties. To address such challenges, we propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution. MIMOSA first pretrains two property agnostic graph neural networks (GNNs) for molecule topology and substructure-type prediction, where a substructure can be either atom or single ring. For each iteration, MIMOSA uses the GNNs' prediction and employs three basic substructure operations (add, replace, delete) to generate new molecules and associated weights. The weights can encode multiple constraints including similarity and drug property constraints, upon which we select promising molecules for next iteration. MIMOSA enables flexible encoding of multiple property- and similarity-constraints and can efficiently generate new molecules that satisfy various property constraints and achieved up to 49.6% relative improvement over the best baseline in terms of success rate. The code repository (including readme file, data preprocessing and model construction, evaluation) is available https://github.com/futianfan/MIMOSA.  ( 2 min )
    Interpretable Concept-based Prototypical Networks for Few-Shot Learning. (arXiv:2202.13474v1 [cs.LG])
    Few-shot learning aims at recognizing new instances from classes with limited samples. This challenging task is usually alleviated by performing meta-learning on similar tasks. However, the resulting models are black-boxes. There has been growing concerns about deploying black-box machine learning models and FSL is not an exception in this regard. In this paper, we propose a method for FSL based on a set of human-interpretable concepts. It constructs a set of metric spaces associated with the concepts and classifies samples of novel classes by aggregating concept-specific decisions. The proposed method does not require concept annotations for query samples. This interpretable method achieved results on a par with six previously state-of-the-art black-box FSL methods on the CUB fine-grained bird classification dataset.  ( 2 min )
    Recent Progress in the CUHK Dysarthric Speech Recognition System. (arXiv:2201.05845v2 [eess.AS] UPDATED)
    Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.  ( 2 min )
    Combating the Instability of Mutual Information-based Losses via Regularization. (arXiv:2011.07932v2 [cs.LG] UPDATED)
    Notable progress has been made in numerous fields of machine learning based on neural network-driven mutual information (MI) bounds. However, utilizing the conventional MI-based losses is often challenging due to their practical and mathematical limitations. In this work, we first identify the symptoms behind their instability: (1) the neural network not converging even after the loss seemed to converge, and (2) saturating neural network outputs causing the loss to diverge. We mitigate both issues by adding a novel regularization term to the existing losses. We theoretically and experimentally demonstrate that added regularization stabilizes training. Finally, we present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks, closely following the pre-existing supervised and contrastive learning settings. We evaluate six different MI-based losses and their regularized counterparts on multiple benchmarks to show that our approach is simple yet effective.  ( 2 min )
    Probabilistic Inference of Simulation Parameters via Parallel Differentiable Simulation. (arXiv:2109.08815v2 [cs.RO] UPDATED)
    To accurately reproduce measurements from the real world, simulators need to have an adequate model of the physical system and require the parameters of the model be identified. We address the latter problem of estimating parameters through a Bayesian inference approach that approximates a posterior distribution over simulation parameters given real sensor measurements. By extending the commonly used Gaussian likelihood model for trajectories via the multiple-shooting formulation, our chosen particle-based inference algorithm Stein Variational Gradient Descent is able to identify highly nonlinear, underactuated systems. We leverage GPU code generation and differentiable simulation to evaluate the likelihood and its gradient for many particles in parallel. Our algorithm infers non-parametric distributions over simulation parameters more accurately than comparable baselines and handles constraints over parameters efficiently through gradient-based optimization. We evaluate estimation performance on several physical experiments. On an underactuated mechanism where a 7-DOF robot arm excites an object with an unknown mass configuration, we demonstrate how our inference technique can identify symmetries between the parameters and provide highly accurate predictions. Project website: https://uscresl.github.io/prob-diff-sim  ( 2 min )
    A Multimodal German Dataset for Automatic Lip Reading Systems and Transfer Learning. (arXiv:2202.13403v1 [cs.CV])
    Large datasets as required for deep learning of lip reading do not exist in many languages. In this paper we present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models. We demonstrate learning from scratch and show that transfer learning from LRW to GLips and vice versa improves learning speed and performance, in particular for the validation set.  ( 2 min )
    Robust and Accurate Authorship Attribution via Program Normalization. (arXiv:2007.00772v3 [cs.LG] UPDATED)
    Source code attribution approaches have achieved remarkable accuracy thanks to the rapid advances in deep learning. However, recent studies shed light on their vulnerability to adversarial attacks. In particular, they can be easily deceived by adversaries who attempt to either create a forgery of another author or to mask the original author. To address these emerging issues, we formulate this security challenge into a general threat model, the $\textit{relational adversary}$, that allows an arbitrary number of the semantics-preserving transformations to be applied to an input in any problem space. Our theoretical investigation shows the conditions for robustness and the trade-off between robustness and accuracy in depth. Motivated by these insights, we present a novel learning framework, $\textit{normalize-and-predict}$ ($\textit{N&P}$), that in theory guarantees the robustness of any authorship-attribution approach. We conduct an extensive evaluation of $\textit{N&P}$ in defending two of the latest authorship-attribution approaches against state-of-the-art attack methods. Our evaluation demonstrates that $\textit{N&P}$ improves the accuracy on adversarial inputs by as much as 70% over the vanilla models. More importantly, $\textit{N&P}$ also increases robust accuracy to 45% higher than adversarial training while running over 40 times faster.  ( 2 min )
    Missing Value Knockoffs. (arXiv:2202.13054v1 [stat.ML])
    One limitation of the most statistical/machine learning-based variable selection approaches is their inability to control the false selections. A recently introduced framework, model-x knockoffs, provides that to a wide range of models but lacks support for datasets with missing values. In this work, we discuss ways of preserving the theoretical guarantees of the model-x framework in the missing data setting. First, we prove that posterior sampled imputation allows reusing existing knockoff samplers in the presence of missing values. Second, we show that sampling knockoffs only for the observed variables and applying univariate imputation also preserves the false selection guarantees. Third, for the special case of latent variable models, we demonstrate how jointly imputing and sampling knockoffs can reduce the computational complexity. We have verified the theoretical findings with two different exploratory variable distributions and investigated how the missing data pattern, amount of correlation, the number of observations, and missing values affected the statistical power.  ( 2 min )
    Hierarchical Gaussian Process Models for Regression Discontinuity/Kink under Sharp and Fuzzy Designs. (arXiv:2110.00921v2 [stat.ML] UPDATED)
    We propose nonparametric Bayesian estimators for causal inference exploiting Regression Discontinuity/Kink (RD/RK) under sharp and fuzzy designs. Our estimators are based on Gaussian Process (GP) regression and classification. The GP methods are powerful probabilistic machine learning approaches that are advantageous in terms of derivative estimation and uncertainty quantification, facilitating RK estimation and inference of RD/RK models. These estimators are extended to hierarchical GP models with an intermediate Bayesian neural network layer and can be characterized as hybrid deep learning models. Monte Carlo simulations show that our estimators perform comparably to and sometimes better than competing estimators in terms of precision, coverage and interval length. The hierarchical GP models considerably improve upon one-layer GP models. We apply the proposed methods to estimate the incumbency advantage of US house elections. Our estimations suggest a significant incumbency advantage in terms of both vote share and probability of winning in the next elections. Lastly we present an extension to accommodate covariate adjustment.  ( 2 min )
    Generalized Equivariance and Preferential Labeling for GNN Node Classification. (arXiv:2102.11485v3 [cs.LG] UPDATED)
    Existing graph neural networks (GNNs) largely rely on node embeddings, which represent a node as a vector by its identity, type, or content. However, graphs with unattributed nodes widely exist in real-world applications (e.g., anonymized social networks). Previous GNNs either assign random labels to nodes (which introduces artefacts to the GNN) or assign one embedding to all nodes (which fails to explicitly distinguish one node from another). Further, when these GNNs are applied to unattributed node classification problems, they have an undesired equivariance property, which are fundamentally unable to address the data with multiple possible outputs. In this paper, we analyze the limitation of existing approaches to node classification problems. Inspired by our analysis, we propose a generalized equivariance property and a Preferential Labeling technique that satisfies the desired property asymptotically. Experimental results show that we achieve high performance in several unattributed node classification tasks.  ( 2 min )
    Projective Ranking-based GNN Evasion Attacks. (arXiv:2202.12993v1 [cs.LG])
    Graph neural networks (GNNs) offer promising learning methods for graph-related tasks. However, GNNs are at risk of adversarial attacks. Two primary limitations of the current evasion attack methods are highlighted: (1) The current GradArgmax ignores the "long-term" benefit of the perturbation. It is faced with zero-gradient and invalid benefit estimates in certain situations. (2) In the reinforcement learning-based attack methods, the learned attack strategies might not be transferable when the attack budget changes. To this end, we first formulate the perturbation space and propose an evaluation framework and the projective ranking method. We aim to learn a powerful attack strategy then adapt it as little as possible to generate adversarial samples under dynamic budget settings. In our method, based on mutual information, we rank and assess the attack benefits of each perturbation for an effective attack strategy. By projecting the strategy, our method dramatically minimizes the cost of learning a new attack strategy when the attack budget changes. In the comparative assessment with GradArgmax and RL-S2V, the results show our method owns high attack performance and effective transferability. The visualization of our method also reveals various attack patterns in the generation of adversarial samples.  ( 2 min )
    Typing assumptions improve identification in causal discovery. (arXiv:2107.10703v2 [cs.LG] UPDATED)
    Causal discovery from observational data is a challenging task that can only be solved up to a set of equivalent solutions, called an equivalence class. Such classes, which are often large in size, encode uncertainties about the orientation of some edges in the causal graph. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of variables, thus circumscribing the equivalence class. Namely, we introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph. We also propose causal discovery algorithms that make use of these assumptions and demonstrate their benefits on simulated and pseudo-real data.  ( 2 min )
    SimSR: Simple Distance-based State Representation for Deep Reinforcement Learning. (arXiv:2112.15303v2 [cs.LG] UPDATED)
    This work explores how to learn robust and generalizable state representation from image-based observations with deep reinforcement learning methods. Addressing the computational complexity, stringent assumptions and representation collapse challenges in existing work of bisimulation metric, we devise Simple State Representation (SimSR) operator. SimSR enables us to design a stochastic approximation method that can practically learn the mapping functions (encoders) from observations to latent representation space. In addition to the theoretical analysis and comparison with the existing work, we experimented and compared our work with recent state-of-the-art solutions in visual MuJoCo tasks. The results shows that our model generally achieves better performance and has better robustness and good generalization.  ( 2 min )
    Hierarchical Linear Dynamical System for Representing Notes from Recorded Audio. (arXiv:2202.13255v1 [cs.SD])
    We seek to develop simultaneous segmentation and classification of notes from audio recordings in presence of outliers. The selected architecture for modeling time series is hierarchical linear dynamical system (HLDS). We propose a novel method for its parameter setting. HLDS can potentially be employed in two ways: 1) simultaneous segmentation and clustering for exploring data, i.e. finding unknown notes, 2) simultaneous segmentation and classification of audio recording for finding the notes of interest in the presence of outliers. We adapted HLDS for the second purpose since it is an easier task and still a challenging problem, e.g. in the field of bioacoustics. Each test clip has the same notes (but different instances) as of the training clip and also contain outlier notes. At test, it is automatically decided to which class of interest a note belongs to if any. Two applications of this work are to the fields of bioacoustics for detection of animal sounds in audio field recordings and also to musicology. Experiments have been conducted for segmentation and classification of both avian and musical notes from recorded audio.  ( 2 min )
    Self-Supervised and Interpretable Anomaly Detection using Network Transformers. (arXiv:2202.12997v1 [cs.LG])
    Monitoring traffic in computer networks is one of the core approaches for defending critical infrastructure against cyber attacks. Machine Learning (ML) and Deep Neural Networks (DNNs) have been proposed in the past as a tool to identify anomalies in computer networks. Although detecting these anomalies provides an indication of an attack, just detecting an anomaly is not enough information for a user to understand the anomaly. The black-box nature of off-the-shelf ML models prevents extracting important information that is fundamental to isolate the source of the fault/attack and take corrective measures. In this paper, we introduce the Network Transformer (NeT), a DNN model for anomaly detection that incorporates the graph structure of the communication network in order to improve interpretability. The presented approach has the following advantages: 1) enhanced interpretability by incorporating the graph structure of computer networks; 2) provides a hierarchical set of features that enables analysis at different levels of granularity; 3) self-supervised training that does not require labeled data. The presented approach was tested by evaluating the successful detection of anomalies in an Industrial Control System (ICS). The presented approach successfully identified anomalies, the devices affected, and the specific connections causing the anomalies, providing a data-driven hierarchical approach to analyze the behavior of a cyber network.
    Towards Return Parity in Markov Decision Processes. (arXiv:2111.10476v2 [cs.LG] UPDATED)
    Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs sharing the same state and action spaces into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets.  ( 2 min )
    On-chip QNN: Towards Efficient On-Chip Training of Quantum Neural Networks. (arXiv:2202.13239v1 [quant-ph])
    Quantum Neural Network (QNN) is drawing increasing research interest thanks to its potential to achieve quantum advantage on near-term Noisy Intermediate Scale Quantum (NISQ) hardware. In order to achieve scalable QNN learning, the training process needs to be offloaded to real quantum machines instead of using exponential-cost classical simulators. One common approach to obtain QNN gradients is parameter shift whose cost scales linearly with the number of qubits. We present On-chip QNN, the first experimental demonstration of practical on-chip QNN training with parameter shift. Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naive parameter shift have low fidelity and thus degrade the training accuracy. To this end, we further propose probabilistic gradient pruning to firstly identify gradients with potentially large errors and then remove them. Specifically, small gradients have larger relative errors than large ones, thus having a higher probability to be pruned. We perform extensive experiments on 5 classification tasks with 5 real quantum machines. The results demonstrate that our on-chip training achieves over 90% and 60% accuracy for 2-class and 4-class image classification tasks. The probabilistic gradient pruning brings up to 7% QNN accuracy improvements over no pruning. Overall, we successfully obtain similar on-chip training accuracy compared with noise-free simulation but have much better training scalability. The code for parameter shift on-chip training is available in the TorchQuantum library.  ( 2 min )
    A Scalable Graph-Theoretic Distributed Framework for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2202.13046v1 [cs.LG])
    The main challenge of large-scale cooperative multi-agent reinforcement learning (MARL) is two-fold: (i) the RL algorithm is desired to be distributed due to limited resource for each individual agent; (ii) issues on convergence or computational complexity emerge due to the curse of dimensionality. Unfortunately, most of existing distributed RL references only focus on ensuring that the individual policy-seeking process of each agent is based on local information, but fail to solve the scalability issue induced by high dimensions of the state and action spaces when facing large-scale networks. In this paper, we propose a general distributed framework for cooperative MARL by utilizing the structures of graphs involved in this problem. We introduce three graphs in MARL, namely, the coordination graph, the observation graph and the reward graph. Based on these three graphs, and a given communication graph, we propose two distributed RL approaches. The first approach utilizes the inherent decomposability property of the problem itself, whose efficiency depends on the structures of the aforementioned four graphs, and is able to produce a high performance under specific graphical conditions. The second approach provides an approximate solution and is applicable for any graphs. Here the approximation error depends on an artificially designed index. The choice of this index is a trade-off between minimizing the approximation error and reducing the computational complexity. Simulations show that our RL algorithms have a significantly improved scalability to large-scale MASs compared with centralized and consensus-based distributed RL algorithms.  ( 2 min )
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v1 [cs.LG])
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, \emph{without}-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in \emph{any} sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related \emph{with}-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Arrhythmia Classifier Using Convolutional Neural Network with Adaptive Loss-aware Multi-bit Networks Quantization. (arXiv:2202.12943v1 [eess.SP])
    Cardiovascular disease (CVDs) is one of the universal deadly diseases, and the detection of it in the early stage is a challenging task to tackle. Recently, deep learning and convolutional neural networks have been employed widely for the classification of objects. Moreover, it is promising that lots of networks can be deployed on wearable devices. An increasing number of methods can be used to realize ECG signal classification for the sake of arrhythmia detection. However, the existing neural networks proposed for arrhythmia detection are not hardware-friendly enough due to a remarkable quantity of parameters resulting in memory and power consumption. In this paper, we present a 1-D adaptive loss-aware quantization, achieving a high compression rate that reduces memory consumption by 23.36 times. In order to adapt to our compression method, we need a smaller and simpler network. We propose a 17 layer end-to-end neural network classifier to classify 17 different rhythm classes trained on the MIT-BIH dataset, realizing a classification accuracy of 93.5%, which is higher than most existing methods. Due to the adaptive bitwidth method making important layers get more attention and offered a chance to prune useless parameters, the proposed quantization method avoids accuracy degradation. It even improves the accuracy rate, which is 95.84%, 2.34% higher than before. Our study achieves a 1-D convolutional neural network with high performance and low resources consumption, which is hardware-friendly and illustrates the possibility of deployment on wearable devices to realize a real-time arrhythmia diagnosis.  ( 2 min )
    Online $k$-means Clustering on Arbitrary Data Streams. (arXiv:2102.09101v3 [cs.LG] UPDATED)
    We consider online $k$-means clustering where each new point is assigned to the nearest cluster center, after which the algorithm may update its centers. The loss incurred is the sum of squared distances from new points to their assigned cluster centers. The goal over a data stream $X$ is to achieve loss that is a constant factor of $L(X, OPT_k)$, the best possible loss using $k$ fixed points in hindsight. We propose a data parameter, $\Lambda(X)$, such that for any algorithm maintaining $O(k\text{poly}(\log n))$ centers at time $n$, there exists a data stream $X$ for which a loss of $\Omega(\Lambda(X))$ is inevitable. We then give a randomized algorithm that achieves clustering loss $O(\Lambda(X) + L(X, OPT_k))$. Our algorithm uses $O(k\text{poly}(\log n))$ memory and maintains $O(k\text{poly}(\log n))$ cluster centers. Our algorithm also enjoys a running time of $O(k\text{poly}(\log n))$ and is the first algorithm to achieve polynomial space and time complexity in this setting. It also is the first to have provable guarantees without making any assumptions on the input data.  ( 2 min )
    Sparsity-aware neural user behavior modeling in online interaction platforms. (arXiv:2202.13491v1 [cs.LG])
    Modern online platforms offer users an opportunity to participate in a variety of content-creation, social networking, and shopping activities. With the rapid proliferation of such online services, learning data-driven user behavior models is indispensable to enable personalized user experiences. Recently, representation learning has emerged as an effective strategy for user modeling, powered by neural networks trained over large volumes of interaction data. Despite their enormous potential, we encounter the unique challenge of data sparsity for a vast majority of entities, e.g., sparsity in ground-truth labels for entities and in entity-level interactions (cold-start users, items in the long-tail, and ephemeral groups). In this dissertation, we develop generalizable neural representation learning frameworks for user behavior modeling designed to address different sparsity challenges across applications. Our problem settings span transductive and inductive learning scenarios, where transductive learning models entities seen during training and inductive learning targets entities that are only observed during inference. We leverage different facets of information reflecting user behavior (e.g., interconnectivity in social networks, temporal and attributed interaction information) to enable personalized inference at scale. Our proposed models are complementary to concurrent advances in neural architectural choices and are adaptive to the rapid addition of new applications in online platforms.
    GRAPHITE: Generating Automatic Physical Examples for Machine-Learning Attacks on Computer Vision Systems. (arXiv:2002.07088v6 [cs.CR] UPDATED)
    This paper investigates an adversary's ease of attack in generating adversarial examples for real-world scenarios. We address three key requirements for practical attacks for the real-world: 1) automatically constraining the size and shape of the attack so it can be applied with stickers, 2) transform-robustness, i.e., robustness of a attack to environmental physical variations such as viewpoint and lighting changes, and 3) supporting attacks in not only white-box, but also black-box hard-label scenarios, so that the adversary can attack proprietary models. In this work, we propose GRAPHITE, an efficient and general framework for generating attacks that satisfy the above three key requirements. GRAPHITE takes advantage of transform-robustness, a metric based on expectation over transforms (EoT), to automatically generate small masks and optimize with gradient-free optimization. GRAPHITE is also flexible as it can easily trade-off transform-robustness, perturbation size, and query count in black-box settings. On a GTSRB model in a hard-label black-box setting, we are able to find attacks on all possible 1,806 victim-target class pairs with averages of 77.8% transform-robustness, perturbation size of 16.63% of the victim images, and 126K queries per pair. For digital-only attacks where achieving transform-robustness is not a requirement, GRAPHITE is able to find successful small-patch attacks with an average of only 566 queries for 92.2% of victim-target pairs. GRAPHITE is also able to find successful attacks using perturbations that modify small areas of the input image against PatchGuard, a recently proposed defense against patch-based attacks.  ( 3 min )
    Toward Robust Autotuning of Noisy Quantum Dot Devices. (arXiv:2108.00043v2 [quant-ph] UPDATED)
    The current autotuning approaches for quantum dot (QD) devices, while showing some success, lack an assessment of data reliability. This leads to unexpected failures when noisy or otherwise low-quality data is processed by an autonomous system. In this work, we propose a framework for robust autotuning of QD devices that combines a machine learning (ML) state classifier with a data quality control module. The data quality control module acts as a "gatekeeper" system, ensuring that only reliable data are processed by the state classifier. Lower data quality results in either device recalibration or termination. To train both ML systems, we enhance the QD simulation by incorporating synthetic noise typical of QD experiments. We confirm that the inclusion of synthetic noise in the training of the state classifier significantly improves the performance, resulting in an accuracy of 95.0(9) % when tested on experimental data. We then validate the functionality of the data quality control module by showing that the state classifier performance deteriorates with decreasing data quality, as expected. Our results establish a robust and flexible ML framework for autonomous tuning of noisy QD devices.  ( 2 min )
    Risk-Aware Scene Sampling for Dynamic Assurance of Autonomous Systems. (arXiv:2202.13510v1 [cs.RO])
    Autonomous Cyber-Physical Systems must often operate under uncertainties like sensor degradation and shifts in the operating conditions, which increases its operational risk. Dynamic Assurance of these systems requires designing runtime safety components like Out-of-Distribution detectors and risk estimators, which require labeled data from different operating modes of the system that belong to scenes with adverse operating conditions, sensors, and actuator faults. Collecting real-world data of these scenes can be expensive and sometimes not feasible. So, scenario description languages with samplers like random and grid search are available to generate synthetic data from simulators, replicating these real-world scenes. However, we point out three limitations in using these conventional samplers. First, they are passive samplers, which do not use the feedback of previous results in the sampling process. Second, the variables to be sampled may have constraints that are often not included. Third, they do not balance the tradeoff between exploration and exploitation, which we hypothesize is necessary for better search space coverage. We present a scene generation approach with two samplers called Random Neighborhood Search (RNS) and Guided Bayesian Optimization (GBO), which extend the conventional random search and Bayesian Optimization search to include the limitations. Also, to facilitate the samplers, we use a risk-based metric that evaluates how risky the scene was for the system. We demonstrate our approach using an Autonomous Vehicle example in CARLA simulation. To evaluate our samplers, we compared them against the baselines of random search, grid search, and Halton sequence search. Our samplers of RNS and GBO sampled a higher percentage of high-risk scenes of 83% and 92%, compared to 56%, 66% and 71% of the grid, random and Halton samplers, respectively.  ( 2 min )
    Surrogate Model For Field Optimization Using Beta-VAE Based Regression. (arXiv:2008.11433v2 [cs.LG] UPDATED)
    Oilfield development related decisions are made using reservoir simulation-based optimization study in which different production scenarios and well controls are compared. Such simulations are computationally expensive and so surrogate models are used to accelerate studies. Deep learning has been used in past to generate surrogates, but such models often fail to quantify prediction uncertainty and are not interpretable. In this work, beta-VAE based regression is proposed to generate simulation surrogates for use in optimization workflow. beta-VAE enables interpretable, factorized representation of decision variables in latent space, which is then further used for regression. Probabilistic dense layers are used to quantify prediction uncertainty and enable approximate Bayesian inference. Surrogate model developed using beta-VAE based regression finds interpretable and relevant latent representation. A reasonable value of beta ensures a good balance between factor disentanglement and reconstruction. Probabilistic dense layer helps in quantifying predicted uncertainty for objective function, which is then used to decide whether full-physics simulation is required for a case.  ( 2 min )
    Neural-Progressive Hedging: Enforcing Constraints in Reinforcement Learning with Stochastic Programming. (arXiv:2202.13436v1 [cs.LG])
    We propose a framework, called neural-progressive hedging (NP), that leverages stochastic programming during the online phase of executing a reinforcement learning (RL) policy. The goal is to ensure feasibility with respect to constraints and risk-based objectives such as conditional value-at-risk (CVaR) during the execution of the policy, using probabilistic models of the state transitions to guide policy adjustments. The framework is particularly amenable to the class of sequential resource allocation problems since feasibility with respect to typical resource constraints cannot be enforced in a scalable manner. The NP framework provides an alternative that adds modest overhead during the online phase. Experimental results demonstrate the efficacy of the NP framework on two continuous real-world tasks: (i) the portfolio optimization problem with liquidity constraints for financial planning, characterized by non-stationary state distributions; and (ii) the dynamic repositioning problem in bike sharing systems, that embodies the class of supply-demand matching problems. We show that the NP framework produces policies that are better than deep RL and other baseline approaches, adapting to non-stationarity, whilst satisfying structural constraints and accommodating risk measures in the resulting policies. Additional benefits of the NP framework are ease of implementation and better explainability of the policies.
    Parametric Flatten-T Swish: An Adaptive Non-linear Activation Function For Deep Learning. (arXiv:2011.03155v2 [cs.LG] UPDATED)
    Activation function is a key component in deep learning that performs non-linear mappings between the inputs and outputs. Rectified Linear Unit (ReLU) has been the most popular activation function across the deep learning community. However, ReLU contains several shortcomings that can result in inefficient training of the deep neural networks, these are: 1) the negative cancellation property of ReLU tends to treat negative inputs as unimportant information for the learning, resulting in a performance degradation; 2) the inherent predefined nature of ReLU is unlikely to promote additional flexibility, expressivity, and robustness to the networks; 3) the mean activation of ReLU is highly positive and leads to bias shift effect in network layers; and 4) the multilinear structure of ReLU restricts the non-linear approximation power of the networks. To tackle these shortcomings, this paper introduced Parametric Flatten-T Swish (PFTS) as an alternative to ReLU. By taking ReLU as a baseline method, the experiments showed that PFTS improved classification accuracy on SVHN dataset by 0.31%, 0.98%, 2.16%, 17.72%, 1.35%, 0.97%, 39.99%, and 71.83% on DNN-3A, DNN-3B, DNN-4, DNN- 5A, DNN-5B, DNN-5C, DNN-6, and DNN-7, respectively. Besides, PFTS also achieved the highest mean rank among the comparison methods. The proposed PFTS manifested higher non-linear approximation power during training and thereby improved the predictive performance of the networks.  ( 2 min )
    Robust Classification via Support Vector Machines. (arXiv:2104.13458v2 [stat.ML] UPDATED)
    Classification models are very sensitive to data uncertainty, and finding robust classifiers that are less sensitive to data uncertainty has raised great interest in the machine learning literature. This paper aims to construct robust \emph{Support Vector Machine} classifiers under feature data uncertainty via two probabilistic arguments. The first classifier, \emph{Single Perturbation}, reduces the local effect of data uncertainty with respect to one given feature and acts as a local test that could confirm or refute the presence of significant data uncertainty for that particular feature. The second classifier, \emph{Extreme Empirical Loss}, aims to reduce the aggregate effect of data uncertainty with respect to all features, which is possible via a trade-off between the number of prediction model violations and the size of these violations. Both methodologies are computationally efficient and our extensive numerical investigation highlights the advantages and possible limitations of the two robust classifiers on synthetic and real-life data.  ( 2 min )
    Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond. (arXiv:2202.13441v1 [cs.LG])
    An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such methods to generalize well has been attributed to the their implicit bias towards large margin predictors, both asymptotically as well as in finite time. We give an additional unified explanation for this generalization and relate it to two simple properties of the optimization objective, that we refer to as realizability and self-boundedness. We introduce a general setting of unconstrained stochastic convex optimization with these properties, and analyze generalization of gradient methods through the lens of algorithmic stability. In this broader setting, we obtain sharp stability bounds for gradient descent and stochastic gradient descent which apply even for a very large number of gradient steps, and use them to derive general generalization bounds for these algorithms. Finally, as direct applications of the general bounds, we return to the setting of linear classification with separable data and establish several novel test loss and test accuracy bounds for gradient descent and stochastic gradient descent for a variety of loss functions with different tail decay rates. In some of these case, our bounds significantly improve upon the existing generalization error bounds in the literature.  ( 2 min )
    High Dimensional Statistical Estimation under One-bit Quantization. (arXiv:2202.13157v1 [stat.ML])
    Compared with data with high precision, one-bit (binary) data are preferable in many applications because of the efficiency in signal storage, processing, transmission, and enhancement of privacy. In this paper, we study three fundamental statistical estimation problems, i.e., sparse covariance matrix estimation, sparse linear regression, and low-rank matrix completion via binary data arising from an easy-to-implement one-bit quantization process that contains truncation, dithering and quantization as typical steps. Under both sub-Gaussian and heavy-tailed regimes, new estimators that handle high-dimensional scaling are proposed. In sub-Gaussian case, we show that our estimators achieve minimax rates up to logarithmic factors, hence the quantization nearly costs nothing from the perspective of statistical learning rate. In heavy-tailed case, we truncate the data before dithering to achieve a bias-variance trade-off, which results in estimators embracing convergence rates that are the square root of the corresponding minimax rates. Experimental results on synthetic data are reported to support and demonstrate the statistical properties of our estimators under one-bit quantization.  ( 2 min )
    Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms. (arXiv:2202.13001v1 [cs.LG])
    We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting), and the number of tasks $N$ as well as the total number of rounds $T$ are known ($N$ could be unknown in the meta-learning setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(N\sqrt{M \tau}+N^{2/3})$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2})$ regret.  ( 2 min )
    Split HE: Fast Secure Inference Combining Split Learning and Homomorphic Encryption. (arXiv:2202.13351v1 [cs.CR])
    This work presents a novel protocol for fast secure inference of neural networks applied to computer vision applications. It focuses on improving the overall performance of the online execution by deploying a subset of the model weights in plaintext on the client's machine, in the fashion of SplitNNs. We evaluate our protocol on benchmark neural networks trained on the CIFAR-10 dataset using SEAL via TenSEAL and discuss runtime and security performances. Empirical security evaluation using Membership Inference and Model Extraction attacks showed that the protocol was more resilient under the same attacks than a similar approach also based on SplitNN. When compared to related work, we demonstrate improvements of 2.5x-10x for the inference time and 14x-290x in communication costs.
    Outlier-Robust Optimal Transport: Duality, Structure, and Statistical Analysis. (arXiv:2111.01361v3 [stat.ML] UPDATED)
    The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to be minimax optimal for robust estimation under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.  ( 2 min )
    Robust Continual Learning through a Comprehensively Progressive Bayesian Neural Network. (arXiv:2202.13369v1 [cs.LG])
    This work proposes a comprehensively progressive Bayesian neural network for robust continual learning of a sequence of tasks. A Bayesian neural network is progressively pruned and grown such that there are sufficient network resources to represent a sequence of tasks, while the network does not explode. It starts with the contention that similar tasks should have the same number of total network resources, to ensure fair representation of all tasks in a continual learning scenario. Thus, as the data for new task streams in, sufficient neurons are added to the network such that the total number of neurons in each layer of the network, including the shared representations with previous tasks and individual task related representation, are equal for all tasks. The weights that are redundant at the end of training each task are also pruned through re-initialization, in order to be efficiently utilized in the subsequent task. Thus, the network grows progressively, but ensures effective utilization of network resources. We refer to our proposed method as 'Robust Continual Learning through a Comprehensively Progressive Bayesian Neural Network (RCL-CPB)' and evaluate the proposed approach on the MNIST data set, under three different continual learning scenarios. Further to this, we evaluate the performance of RCL-CPB on a homogeneous sequence of tasks using split CIFAR100 (20 tasks of 5 classes each), and a heterogeneous sequence of tasks using MNIST, SVHN and CIFAR10 data sets. The demonstrations and the performance results show that the proposed strategies for progressive BNN enable robust continual learning.  ( 2 min )
    Regularized Bilinear Discriminant Analysis for Multivariate Time Series Data. (arXiv:2202.13188v1 [cs.LG])
    In recent years, the methods on matrix-based or bilinear discriminant analysis (BLDA) have received much attention. Despite their advantages, it has been reported that the traditional vector-based regularized LDA (RLDA) is still quite competitive and could outperform BLDA on some benchmark datasets. Nevertheless, it is also noted that this finding is mainly limited to image data. In this paper, we propose regularized BLDA (RBLDA) and further explore the comparison between RLDA and RBLDA on another type of matrix data, namely multivariate time series (MTS). Unlike image data, MTS typically consists of multiple variables measured at different time points. Although many methods for MTS data classification exist within the literature, there is relatively little work in exploring the matrix data structure of MTS data. Moreover, the existing BLDA can not be performed when one of its within-class matrices is singular. To address the two problems, we propose RBLDA for MTS data classification, where each of the two within-class matrices is regularized via one parameter. We develop an efficient implementation of RBLDA and an efficient model selection algorithm with which the cross validation procedure for RBLDA can be performed efficiently. Experiments on a number of real MTS data sets are conducted to evaluate the proposed algorithm and compare RBLDA with several closely related methods, including RLDA and BLDA. The results reveal that RBLDA achieves the best overall recognition performance and the proposed model selection algorithm is efficient; Moreover, RBLDA can produce better visualization of MTS data than RLDA.
    Dual-Branched Spatio-temporal Fusion Network for Multi-horizon Tropical Cyclone Track Forecast. (arXiv:2202.13336v1 [cs.LG])
    Tropical cyclone (TC) is an extreme tropical weather system and its trajectory can be described by a variety of spatio-temporal data. Effective mining of these data is the key to accurate TCs track forecasting. However, existing methods face the problem that the model complexity is too high or it is difficult to efficiently extract features from multi-modal data. In this paper, we propose the Dual-Branched spatio-temporal Fusion Network (DBF-Net) -- a novel multi-horizon tropical cyclone track forecasting model which fuses the multi-modal features efficiently. DBF-Net contains a TC features branch that extracts temporal features from 1D inherent features of TCs and a pressure field branch that extracts spatio-temporal features from reanalysis 2D pressure field. Through the encoder-decoder-based architecture and efficient feature fusion, DBF-Net can fully mine the information of the two types of data, and achieve good TCs track prediction results. Extensive experiments on historical TCs track data in the Northwest Pacific show that our DBF-Net achieves significant improvement compared with existing statistical and deep learning TCs track forecast methods.
    Analysis of Visual Reasoning on One-Stage Object Detection. (arXiv:2202.13115v1 [cs.CV])
    Current state-of-the-art one-stage object detectors are limited by treating each image region separately without considering possible relations of the objects. This causes dependency solely on high-quality convolutional feature representations for detecting objects successfully. However, this may not be possible sometimes due to some challenging conditions. In this paper, the usage of reasoning features on one-stage object detection is analyzed. We attempted different architectures that reason the relations of the image regions by using self-attention. YOLOv3-Reasoner2 model spatially and semantically enhances features in the reasoning layer and fuses them with the original convolutional features to improve performance. The YOLOv3-Reasoner2 model achieves around 2.5% absolute improvement with respect to baseline YOLOv3 on COCO in terms of mAP while still running in real-time.
    Continuous Human Action Recognition for Human-Machine Interaction: A Review. (arXiv:2202.13096v1 [cs.CV])
    With advances in data-driven machine learning research, a wide variety of prediction models have been proposed to capture spatio-temporal features for the analysis of video streams. Recognising actions and detecting action transitions within an input video are challenging but necessary tasks for applications that require real-time human-machine interaction. By reviewing a large body of recent related work in the literature, we thoroughly analyse, explain and compare action segmentation methods and provide details on the feature extraction and learning strategies that are used on most state-of-the-art methods. We cover the impact of the performance of object detection and tracking techniques on human action segmentation methodologies. We investigate the application of such models to real-world scenarios and discuss several limitations and key research directions towards improving interpretability, generalisation, optimisation and deployment.  ( 2 min )
    Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition. (arXiv:2202.10290v2 [eess.AS] UPDATED)
    Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create large diversity among speakers. To this end, speaker adaptation techniques play a key role in personalization of ASR systems for such users. Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum decomposition are proposed in this paper to facilitate auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN/TDNN and end-to-end Conformer speech recognition systems. Experiments were conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed spectro-temporal deep feature adapted systems outperformed baseline i-Vector and xVector adaptation by up to 2.63% absolute (8.63% relative) reduction in word error rate (WER). Consistent performance improvements were retained after model based speaker adaptation using learning hidden unit contributions (LHUC) was further applied. The best speaker adapted system using the proposed spectral basis embedding features produced the lowest published WER of 25.05% on the UASpeech test set of 16 dysarthric speakers.  ( 2 min )
    Deep Spatially and Temporally Aware Similarity Computation for Road Network Constrained Trajectories. (arXiv:2112.09339v2 [cs.LG] UPDATED)
    Trajectory similarity computation has drawn massive attention, as it is core functionality in a wide range of applications such as ride-sharing, traffic analysis, and social recommendation. Motivated by the recent success of deep learning technologies, researchers start devoting efforts to learning-based similarity analyses to overcome the limitations (i.e., high cost and poor adaptability) of traditional methods. Specifically, deep trajectory similarity computation aims to learn a distance function that can evaluate how similar two trajectories are via neural networks. However, existing learning-based methods focus on spatial similarity but ignore the time dimension of trajectories, which is suboptimal for time-aware applications. Besides, they tend to disregard the embedding of trajectories into road networks, restricting their applicability in real scenarios. In this paper, we propose an effective learning-based framework, called ST2Vec, to perform efficient spatially and temporally aware trajectory similarity computation in road networks. Finally, extensive experimental evaluation using three real trajectory data sets shows that ST2Vec outperforms all the state-of-the-art approaches substantially.  ( 2 min )
    Modulation and signal class labelling using active learning and classification using machine learning. (arXiv:2202.12930v1 [eess.SP])
    Supervised learning in machine learning (ML) requires labelled data set. Further real-time data classification requires an easily available methodology for labelling. Wireless modulation and signal classification find their application in plenty of areas such as military, commercial and electronic reconaissance and cognitive radio. This paper mainly aims to solve the problem of real-time wireless modulation and signal class labelling with an active learning framework. Further modulation and signal classification is performed with machine learning algorithms such as KNN, SVM, Naive bayes. Active learning helps in labelling the data points belonging to different classes with the least amount of data samples trained. An accuracy of 86 percent is obtained by the active learning algorithm for the signal with SNR 18 dB. Further, KNN based model for modulation and signal classification performs well over range of SNR, and an accuracy of 99.8 percent is obtained for 18 dB signal. The novelty of this work exists in applying active learning for wireless modulation and signal class labelling. Both modulation and signal classes are labelled at a given time with help of couplet formation from the data samples.  ( 2 min )
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v3 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.  ( 2 min )
    On the Power and Limitations of Random Features for Understanding Neural Networks. (arXiv:1904.00687v4 [cs.LG] UPDATED)
    Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks.  ( 2 min )
    Uncertainty quantification of a three-dimensional in-stent restenosis model with surrogate modelling. (arXiv:2111.06173v2 [cs.CE] UPDATED)
    In-Stent Restenosis is a recurrence of coronary artery narrowing due to vascular injury caused by balloon dilation and stent placement. It may lead to the relapse of angina symptoms or to an acute coronary syndrome. An uncertainty quantification of a model for In-Stent Restenosis with four uncertain parameters (endothelium regeneration time, the threshold strain for smooth muscle cells bond breaking, blood flow velocity and the percentage of fenestration in the internal elastic lamina) is presented. Two quantities of interest were studied, namely the average cross-sectional area and the maximum relative area loss in a vessel. Due to the computational intensity of the model and the number of evaluations required in the uncertainty quantification, a surrogate model, based on Gaussian process regression with proper orthogonal decomposition, was developed which subsequently replaced the original In-Stent Restenosis model in the uncertainty quantification. A detailed analysis of the uncertainty propagation and sensitivity analysis is presented. Around 11% and 16% of uncertainty are observed on the average cross-sectional area and maximum relative area loss respectively, and the uncertainty estimates show that a higher fenestration mainly determines uncertainty in the neointimal growth at the initial stage of the process. On the other hand, the uncertainty in blood flow velocity and endothelium regeneration time mainly determine the uncertainty in the quantities of interest at the later, clinically relevant stages of the restenosis process. The uncertainty in the threshold strain is relatively small compared to the other uncertain parameters.  ( 2 min )
    Reformulation of the No-Free-Lunch Theorem for Entangled Data Sets. (arXiv:2007.04900v2 [quant-ph] UPDATED)
    The no-free-lunch (NFL) theorem is a celebrated result in learning theory that limits one's ability to learn a function with a training data set. With the recent rise of quantum machine learning, it is natural to ask whether there is a quantum analog of the NFL theorem, which would restrict a quantum computer's ability to learn a unitary process (the quantum analog of a function) with quantum training data. However, in the quantum setting, the training data can possess entanglement, a strong correlation with no classical analog. In this work, we show that entangled data sets lead to an apparent violation of the (classical) NFL theorem. This motivates a reformulation that accounts for the degree of entanglement in the training set. As our main result, we prove a quantum NFL theorem whereby the fundamental limit on the learnability of a unitary is reduced by entanglement. We employ Rigetti's quantum computer to test both the classical and quantum NFL theorems. Our work establishes that entanglement is a commodity in quantum machine learning.  ( 2 min )
    Generalised Gaussian Process Latent Variable Models (GPLVM) with Stochastic Variational Inference. (arXiv:2202.12979v1 [cs.LG])
    Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM Titsias and Lawrence, 2010] uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorized Gaussian yielding a tractable lower bound. However, the non-factories ability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model's performance by benchmarking against the canonical sparse GPLVM for high-dimensional data examples.  ( 2 min )
    Model-free Reinforcement Learning for Content Caching at the Wireless Edge via Restless Bandits. (arXiv:2202.13187v1 [cs.NI])
    An explosive growth in the number of on-demand content requests has imposed significant pressure on current wireless network infrastructure. To enhance the perceived user experience, and support latency-sensitive applications, edge computing has emerged as a promising computing paradigm. The performance of a wireless edge depends on contents that are cached. In this paper, we consider the problem of content caching at the wireless edge with unreliable channels to minimize average content request latency. We formulate this problem as a restless bandit problem, which is provably hard to solve. We begin by investigating a discounted counterpart, and prove that it admits an optimal policy of the threshold-type. We then show that the result also holds for the average latency problem. Using these structural results, we establish the indexability of the problem, and employ Whittle index policy to minimize average latency. Since system parameters such as content request rate are often unknown, we further develop a model-free reinforcement learning algorithm dubbed Q-Whittle learning that relies on our index policy. We also derive a bound on its finite-time convergence rate. Simulation results using real traces demonstrate that our proposed algorithms yield excellent empirical performance.  ( 2 min )
    CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks. (arXiv:2109.10696v2 [cs.LG] UPDATED)
    In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks -- small modifications of the input that change the predictions. Besides rigorously studied $\ell_p$-bounded additive perturbations, recently proposed semantic perturbations (e.g. rotation, translation) raise a serious concern on deploying ML systems in real-world. Therefore, it is important to provide provable guarantees for deep learning models against semantically meaningful input transformations. In this paper, we propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds that can be used in general attack settings. We estimate the probability of a model to fail if the attack is sampled from a certain distribution. Our theoretical findings are supported by experimental results on different datasets.  ( 2 min )
    2021 BEETL Competition: Advancing Transfer Learning for Subject Independence & Heterogenous EEG Data Sets. (arXiv:2202.12950v1 [eess.SP])
    Transfer learning and meta-learning offer some of the most promising avenues to unlock the scalability of healthcare and consumer technologies driven by biosignal data. This is because current methods cannot generalise well across human subjects' data and handle learning from different heterogeneously collected data sets, thus limiting the scale of training data. On the other side, developments in transfer learning would benefit significantly from a real-world benchmark with immediate practical application. Therefore, we pick electroencephalography (EEG) as an exemplar for what makes biosignal machine learning hard. We design two transfer learning challenges around diagnostics and Brain-Computer-Interfacing (BCI), that have to be solved in the face of low signal-to-noise ratios, major variability among subjects, differences in the data recording sessions and techniques, and even between the specific BCI tasks recorded in the dataset. Task 1 is centred on the field of medical diagnostics, addressing automatic sleep stage annotation across subjects. Task 2 is centred on Brain-Computer Interfacing (BCI), addressing motor imagery decoding across both subjects and data sets. The BEETL competition with its over 30 competing teams and its 3 winning entries brought attention to the potential of deep transfer learning and combinations of set theory and conventional machine learning techniques to overcome the challenges. The results set a new state-of-the-art for the real-world BEETL benchmark.  ( 2 min )
    Automating Data Science: Prospects and Challenges. (arXiv:2105.05699v2 [cs.DB] UPDATED)
    Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.  ( 2 min )
    BagPipe: Accelerating Deep Recommendation Model Training. (arXiv:2202.12429v1 [cs.DC] CROSS LISTED)
    Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging primarily because they consist of billions of embedding-based parameters which are often stored remotely leading to significant overheads from embedding access. By profiling existing DLRM training, we observe that only 8.5% of the iteration time is spent in forward/backward pass while the remaining time is spent on embedding and model synchronization. Our key insight in this paper is that access to embeddings have a specific structure and pattern which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with almost 1% of embeddings represent more than 92% of total accesses. Further, we observe that during training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insight, we propose Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We designed an Oracle Cacher, a new system component which uses our lookahead algorithm to generate optimal cache update decisions and provide strong consistency guarantees. Our experiments using three datasets and two models shows that our approach provides a speed up of up to 6.2x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.  ( 2 min )
    PRIMA: General and Precise Neural Network Certification via Scalable Convex Hull Approximations. (arXiv:2103.03638v3 [cs.AI] UPDATED)
    Formal verification of neural networks is critical for their safe adoption in real-world applications. However, designing a precise and scalable verifier which can handle different activation functions, realistic network architectures and relevant specifications remains an open and difficult challenge. In this paper, we take a major step forward in addressing this challenge and present a new verification framework, called PRIMA. PRIMA is both (i) general: it handles any non-linear activation function, and (ii) precise: it computes precise convex abstractions involving multiple neurons via novel convex hull approximation algorithms that leverage concepts from computational geometry. The algorithms have polynomial complexity, yield fewer constraints, and minimize precision loss. We evaluate the effectiveness of PRIMA on a variety of challenging tasks from prior work. Our results show that PRIMA is significantly more precise than the state-of-the-art, verifying robustness to input perturbations for up to 20%, 30%, and 34% more images than existing work on ReLU-, Sigmoid-, and Tanh-based networks, respectively. Further, PRIMA enables, for the first time, the precise verification of a realistic neural network for autonomous driving within a few minutes.  ( 2 min )
    Federated Distillation of Natural Language Understanding with Confident Sinkhorns. (arXiv:2110.02432v1 [cs.CL] CROSS LISTED)
    Enhancing the user experience is an essential task for application service providers. For instance, two users living wide apart may have different tastes of food. A food recommender mobile application installed on an edge device might want to learn from user feedback (reviews) to satisfy the client's needs pertaining to distinct domains. Retrieving user data comes at the cost of privacy while asking for model parameters trained on a user device becomes space inefficient at a large scale. In this work, we propose an approach to learn a central (global) model from the federation of (local) models which are trained on user-devices, without disclosing the local data or model parameters to the server. We propose a federation mechanism for the problems with natural similarity metric between the labels which commonly appear in natural language understanding (NLU) tasks. To learn the global model, the objective is to minimize the optimal transport cost of the global model's predictions from the confident sum of soft-targets assigned by local models. The confidence (a model weighting scheme) score of a model is defined as the L2 distance of a model's prediction from its probability bias. The method improves the global model's performance over the baseline designed on three NLU tasks with intrinsic label space semantics, i.e., fine-grained sentiment analysis, emotion recognition in conversation, and natural language inference. We make our codes public at https://github.com/declare-lab/sinkhorn-loss.  ( 2 min )
    Joint Learning of Linear Time-Invariant Dynamical Systems. (arXiv:2112.10955v3 [stat.ML] UPDATED)
    Learning the parameters of multiple related linear time-invariant (LTI) systems is a problem of interest that remains unexplored to date. We develop a joint estimator for learning the transition matrices of multiple LTI systems that share common basis matrices. Further, we establish finite-time error bounds that reflect the influence of trajectory lengths, dimension, number of systems, and the transition matrices. The results hold under fairly general assumptions and showcase the gains from pooling data across systems, in comparison to learning individual systems separately. We also show robustness to misspecifications in the joint structure of the LTI systems. To derive the results, novel methods are developed for establishing spectral properties of transition matrices and of dependent random matrices, as well as effects of unknown deviation matrices on joint learning.  ( 2 min )
    Nuances in Margin Conditions Determine Gains in Active Learning. (arXiv:2110.08418v2 [stat.ML] UPDATED)
    We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $E[Y|X]$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction between the two settings. Namely, we show that some seemingly benign nuances in notions of margin -- involving the uniqueness of the Bayes classifier, and which have no apparent effect on rates in passive learning -- determine whether or not any active learner can outperform passive learning rates. In particular, for Audibert-Tsybakov's margin condition (allowing general situations with non-unique Bayes classifiers), no active learner can gain over passive learning in commonly studied settings where the marginal on $X$ is near uniform. Our results thus negate the usual intuition from past literature that active rates should improve over passive rates in nonparametric settings.  ( 2 min )
    The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures. (arXiv:2008.00297v2 [cs.CR] UPDATED)
    The concept of learned index structures relies on the idea that the input-output functionality of a database index can be viewed as a prediction task and, thus, be implemented using a machine learning model instead of traditional algorithmic techniques. This novel angle for a decades-old problem has inspired numerous exciting results in the intersection of machine learning and data structures. However, the main advantage of learned index structures, i.e., the ability to adjust to the data at hand via the underlying ML-model, can become a disadvantage from a security perspective as it could be exploited. In this work, we present the first study of poisoning attacks on learned index structures. The required poisoning approach is different from all previous works since the model under attack is trained on a cumulative distribution function (CDF) and, thus, every injection on the training set has a cascading impact on multiple data values. We formulate the first poisoning attacks on linear regression models trained on the CDF, which is a basic building block of the proposed learned index structures. We generalize our poisoning techniques to attack a more advanced two-stage design of learned index structures called recursive model index (RMI), which has been shown to outperform traditional B-Trees. We evaluate our attacks on real-world and synthetic datasets under a wide variety of parameterizations of the model and show that the error of the RMI increases up to $300\times$ and the error of its second-stage models increases up to $3000\times$.  ( 2 min )
    Rethinking Multidimensional Discriminator Output for Generative Adversarial Networks. (arXiv:2109.03378v2 [stat.ML] UPDATED)
    The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting. Proofs of properties are based on our proposed maximal p-centrality discrepancy, which is bounded above by p-Wasserstein distance and fits the Wasserstein GAN framework with multidimensional critic output n. Especially when n = 1 and p = 1, the proposed discrepancy equals 1-Wasserstein distance. Theoretical analysis and empirical evidence show that high-dimensional critic output has its advantage on distinguishing real and fake distributions, and benefits faster convergence and diversity of results.  ( 2 min )
    Integrating Deep Reinforcement and Supervised Learning to Expedite Indoor Mapping. (arXiv:2109.08490v2 [cs.LG] UPDATED)
    The challenge of mapping indoor environments is addressed. Typical heuristic algorithms for solving the motion planning problem are frontier-based methods, that are especially effective when the environment is completely unknown. However, in cases where prior statistical data on the environment's architectonic features is available, such algorithms can be far from optimal. Furthermore, their calculation time may increase substantially as more areas are exposed. In this paper we propose two means by which to overcome these shortcomings. One is the use of deep reinforcement learning to train the motion planner. The second is the inclusion of a pre-trained generative deep neural network, acting as a map predictor. Each one helps to improve the decision making through use of the learned structural statistics of the environment, and both, being realized as neural networks, ensure a constant calculation time. We show that combining the two methods can shorten the duration of the mapping process by up to 4 times, compared to frontier-based motion planning.  ( 2 min )
    Attention-based Clinical Note Summarization. (arXiv:2104.08942v3 [cs.CL] UPDATED)
    In recent years, the trend of deploying digital systems in numerous industries has hiked. The health sector has observed an extensive adoption of digital systems and services that generate significant medical records. Electronic health records contain valuable information for prospective and retrospective analysis that is often not entirely exploited because of the complicated dense information storage. The crude purpose of condensing health records is to select the information that holds most characteristics of the original documents based on a reported disease. These summaries may boost diagnosis and save a doctor's time during a saturated workload situation like the COVID-19 pandemic. In this paper, we are applying a multi-head attention-based mechanism to perform extractive summarization of meaningful phrases on clinical notes. Our method finds major sentences for a summary by correlating tokens, segments, and positional embeddings of sentences in a clinical note. The model outputs attention scores that are statistically transformed to extract critical phrases for visualization on the heat-mapping tool and for human use.  ( 2 min )
    Lower Bounds for Non-Convex Stochastic Optimization. (arXiv:1912.02365v2 [math.OC] UPDATED)
    We lower bound the complexity of finding $\epsilon$-stationary points (with gradient norm at most $\epsilon$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $\epsilon^{-4}$ queries to find an $\epsilon$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $\epsilon^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.  ( 2 min )
    Double Thompson Sampling in Finite stochastic Games. (arXiv:2202.10008v2 [cs.LG] UPDATED)
    We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of $\tilde{\mathcal{O}}(D\sqrt{SAT})$in time horizon $T$ with $S$ states, $A$ actions and diameter $D$. DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of $\tilde{\mathcal{O}}(\sqrt{T}/S^{2})$. Which is by far the best regret bound for finite discounted Markov Decision Process to our knowledge. Numerical results proves the efficiency and superiority of our approach.  ( 2 min )
    Learning to Estimate Without Bias. (arXiv:2110.12403v2 [cs.LG] UPDATED)
    The Gauss Markov theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non-linear settings via deep learning with bias constraints. The classical approach to designing non-linear MVUEs is through maximum likelihood estimation (MLE) which often involves real-time computationally challenging optimizations. On the other hand, deep learning methods allow for non-linear estimators with fixed computational complexity. Learning based estimators perform optimally on average with respect to their training set but may suffer from significant bias in other parameters. To avoid this, we propose to add a simple bias constraint to the loss function, resulting in an estimator we refer to as Bias Constrained Estimator (BCE). We prove that this yields asymptotic MVUEs that behave similarly to the classical MLEs and asymptotically attain the Cramer Rao bound. We demonstrate the advantages of our approach in the context of signal to noise ratio estimation as well as covariance estimation. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance. Examples include distributed sensor networks and data augmentation in test-time. In such applications, we show that BCE leads to asymptotically consistent estimators.  ( 2 min )
    FOLD-RM: A Scalable and Efficient Inductive Learning Algorithm for Multi-Category Classification of Mixed Data. (arXiv:2202.06913v2 [cs.LG] UPDATED)
    FOLD-RM is an automated inductive learning algorithm for learning default rules for mixed (numerical and categorical) data. It generates an (explainable) answer set programming (ASP) rule set for multi-category classification tasks while maintaining efficiency and scalability. The FOLD-RM algorithm is competitive in performance with the widely-used XGBoost algorithm, however, unlike XGBoost, the FOLD-RM algorithm produces an explainable model. FOLD-RM outperforms XGBoost on some datasets, particularly large ones. FOLD-RM also provides human-friendly explanations for predictions.  ( 2 min )
    Sequential Deconfounding for Causal Inference with Unobserved Confounders. (arXiv:2104.09323v3 [stat.ME] UPDATED)
    Using observational data to estimate the effect of a treatment is a powerful tool for decision-making when randomized experiments are infeasible or costly. However, observational data often yields biased estimates of treatment effects, since treatment assignment can be confounded by unobserved variables. A remedy is offered by deconfounding methods that adjust for such unobserved confounders. In this paper, we develop the Sequential Deconfounder, a method that enables estimating individualized treatment effects over time in presence of unobserved confounders. This is the first deconfounding method that can be used in a general sequential setting (i.e., with one or more treatments assigned at each timestep). The Sequential Deconfounder uses a novel Gaussian process latent variable model to infer substitutes for the unobserved confounders, which are then used in conjunction with an outcome model to estimate treatment effects over time. We prove that using our method yields unbiased estimates of individualized treatment responses over time. Using simulated and real medical data, we demonstrate the efficacy of our method in deconfounding the estimation of treatment responses over time.  ( 2 min )
    Finding Valid Adjustments under Non-ignorability with Minimal DAG Knowledge. (arXiv:2106.11560v3 [cs.LG] UPDATED)
    Treatment effect estimation from observational data is a fundamental problem in causal inference. There are two very different schools of thought that have tackled this problem. On one hand, Pearlian framework commonly assumes structural knowledge (provided by an expert) in form of directed acyclic graphs and provides graphical criteria such as back-door criterion to identify valid adjustment sets. On other hand, potential outcomes (PO) framework commonly assumes that all observed features satisfy ignorability (i.e., no hidden confounding), which in general is untestable. In prior works that attempted to bridge these frameworks, there is an observational criteria to identify an anchor variable and if a subset of covariates (not involving the anchor variable) passes a suitable conditional independence criteria, then that subset is a valid back-door. Our main result strengthens these prior results by showing that under a different expert-driven structural knowledge -- that one variable is a direct causal parent of treatment variable -- remarkably, testing for subsets (not involving the known parent variable) that are valid back-doors is equivalent to an invariance test. Importantly, we also cover the non-trivial case where entire set of observed features is not ignorable (generalizing the PO framework) without requiring knowledge of all parents of treatment variable. Our key technical idea involves generation of a synthetic sub-sampling (or environment) variable that is a function of the known parent variable. In addition to designing an invariance test, this sub-sampling variable allows us to leverage Invariant Risk Minimization, and thus, connects finding valid adjustments (in non-ignorable observational setting) to representation learning. We demonstrate effectiveness and tradeoffs of our approaches on a variety of synthetic data as well as real causal effect estimation benchmarks.  ( 2 min )
    Texture Characterization of Histopathologic Images Using Ecological Diversity Measures and Discrete Wavelet Transform. (arXiv:2202.13270v1 [cs.CV])
    Breast cancer is a health problem that affects mainly the female population. An early detection increases the chances of effective treatment, improving the prognosis of the disease. In this regard, computational tools have been proposed to assist the specialist in interpreting the breast digital image exam, providing features for detecting and diagnosing tumors and cancerous cells. Nonetheless, detecting tumors with a high sensitivity rate and reducing the false positives rate is still challenging. Texture descriptors have been quite popular in medical image analysis, particularly in histopathologic images (HI), due to the variability of both the texture found in such images and the tissue appearance due to irregularity in the staining process. Such variability may exist depending on differences in staining protocol such as fixation, inconsistency in the staining condition, and reagents, either between laboratories or in the same laboratory. Textural feature extraction for quantifying HI information in a discriminant way is challenging given the distribution of intrinsic properties of such images forms a non-deterministic complex system. This paper proposes a method for characterizing texture across HIs with a considerable success rate. By employing ecological diversity measures and discrete wavelet transform, it is possible to quantify the intrinsic properties of such images with promising accuracy on two HI datasets compared with state-of-the-art methods.  ( 2 min )
    Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization. (arXiv:2106.06916v2 [cs.LG] UPDATED)
    As Artificial Intelligence as a Service gains popularity, protecting well-trained models as intellectual property is becoming increasingly important. There are two common types of protection methods: ownership verification and usage authorization. In this paper, we propose Non-Transferable Learning (NTL), a novel approach that captures the exclusive data representation in the learned model and restricts the model generalization ability to certain domains. This approach provides effective solutions to both model verification and authorization. Specifically: 1) For ownership verification, watermarking techniques are commonly used but are often vulnerable to sophisticated watermark removal methods. By comparison, our NTL-based ownership verification provides robust resistance to state-of-the-art watermark removal methods, as shown in extensive experiments with 6 removal approaches over the digits, CIFAR10 & STL10, and VisDA datasets. 2) For usage authorization, prior solutions focus on authorizing specific users to access the model, but authorized users can still apply the model to any data without restriction. Our NTL-based authorization approach instead provides data-centric protection, which we call applicability authorization, by significantly degrading the performance of the model on unauthorized data. Its effectiveness is also shown through experiments on the aforementioned datasets.  ( 2 min )
    pix2rule: End-to-end Neuro-symbolic Rule Learning. (arXiv:2106.07487v3 [cs.LG] UPDATED)
    Humans have the ability to seamlessly combine low-level visual input with high-level symbolic reasoning often in the form of recognising objects, learning relations between them and applying rules. Neuro-symbolic systems aim to bring a unifying approach to connectionist and logic-based principles for visual processing and abstract reasoning respectively. This paper presents a complete neuro-symbolic method for processing images into objects, learning relations and logical rules in an end-to-end fashion. The main contribution is a differentiable layer in a deep learning architecture from which symbolic relations and rules can be extracted by pruning and thresholding. We evaluate our model using two datasets: subgraph isomorphism task for symbolic rule learning and an image classification domain with compound relations for learning objects, relations and rules. We demonstrate that our model scales beyond state-of-the-art symbolic learners and outperforms deep relational neural network architectures.  ( 2 min )
    Nonperturbative renormalization for the neural network-QFT correspondence. (arXiv:2108.01403v2 [hep-th] UPDATED)
    In a recent work arXiv:2008.08601, Halverson, Maiti and Stoner proposed a description of neural networks in terms of a Wilsonian effective field theory. The infinite-width limit is mapped to a free field theory, while finite $N$ corrections are taken into account by interactions (non-Gaussian terms in the action). In this paper, we study two related aspects of this correspondence. First, we comment on the concepts of locality and power-counting in this context. Indeed, these usual space-time notions may not hold for neural networks (since inputs can be arbitrary), however, the renormalization group provides natural notions of locality and scaling. Moreover, we comment on several subtleties, for example, that data components may not have a permutation symmetry: in that case, we argue that random tensor field theories could provide a natural generalization. Second, we improve the perturbative Wilsonian renormalization from arXiv:2008.08601 by providing an analysis in terms of the nonperturbative renormalization group using the Wetterich-Morris equation. An important difference with usual nonperturbative RG analysis is that only the effective (IR) 2-point function is known, which requires setting the problem with care. Our aim is to provide a useful formalism to investigate neural networks behavior beyond the large-width limit (i.e.~far from Gaussian limit) in a nonperturbative fashion. A major result of our analysis is that changing the standard deviation of the neural network weight distribution can be interpreted as a renormalization flow in the space of networks. We focus on translations invariant kernels and provide preliminary numerical results.  ( 2 min )
    Construction Cost Index Forecasting: A Multi-feature Fusion Approach. (arXiv:2108.10155v5 [cs.LG] UPDATED)
    The construction cost index is an important indicator of the construction industry. Predicting CCI has important practical significance. This paper combines information fusion with machine learning, and proposes a multi-feature fusion (MFF) module for time series forecasting. The main contribution of MFF is to improve the prediction accuracy of CCI, and propose a feature fusion framework for time series. Compared with the convolution module, the MFF module is a module that extracts certain features. Experiments have proved that the combination of MFF module and multi-layer perceptron has a relatively good prediction effect. The MFF neural network model has high prediction accuracy and prediction efficiency, which is a study of continuous attention.  ( 2 min )
    Ray-based classification framework for high-dimensional data. (arXiv:2010.00500v2 [cs.LG] UPDATED)
    While classification of arbitrary structures in high dimensions may require complete quantitative information, for simple geometrical structures, low-dimensional qualitative information about the boundaries defining the structures can suffice. Rather than using dense, multi-dimensional data, we propose a deep neural network (DNN) classification framework that utilizes a minimal collection of one-dimensional representations, called \emph{rays}, to construct the "fingerprint" of the structure(s) based on substantially reduced information. We empirically study this framework using a synthetic dataset of double and triple quantum dot devices and apply it to the classification problem of identifying the device state. We show that the performance of the ray-based classifier is already on par with traditional 2D images for low dimensional systems, while significantly cutting down the data acquisition cost.  ( 2 min )
    Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study. (arXiv:2201.09953v2 [cs.SE] UPDATED)
    Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce DL code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. While hybrid approaches aim for the "best of both worlds," the challenges in applying them in the real world are largely unknown. We conduct a data-driven analysis of challenges -- and resultant bugs -- involved in writing reliable yet performant imperative DL code by studying 250 open-source projects, consisting of 19.7 MLOC, along with 470 and 446 manually examined code patches and bug reports, respectively. The results indicate that hybridization: (i) is prone to API misuse, (ii) can result in performance degradation -- the opposite of its intention, and (iii) has limited application due to execution mode incompatibility. We put forth several recommendations, best practices, and anti-patterns for effectively hybridizing imperative DL code, potentially benefiting DL practitioners, API designers, tool developers, and educators.  ( 2 min )
    Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization. (arXiv:2202.13328v1 [cs.LG])
    We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $\epsilon$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/\epsilon^2)$ and only $\tilde{O}(1/\epsilon^2)$ iterations. This contrasts with general stochastic convex optimization, where $\Omega(1/\epsilon^4)$ iterations are needed Amir et al. [2021b]. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $\Theta(1/\epsilon^4)$ samples, we rely on uniform convergence in a distribution-dependent ball.  ( 2 min )
    Conditional Frechet Inception Distance. (arXiv:2103.11521v2 [cs.LG] UPDATED)
    We consider distance functions between conditional distributions. We focus on the Wasserstein metric and its Gaussian case known as the Frechet Inception Distance (FID). We develop conditional versions of these metrics, analyze their relations and provide a closed form solution to the conditional FID (CFID) metric. We numerically compare the metrics in the context of performance evaluation of modern conditional generative models. Our results show the advantages of CFID compared to the classical FID and mean squared error (MSE) measures. In contrast to FID, CFID is useful in identifying failures where realistic outputs which are not related to their inputs are generated. On the other hand, compared to MSE, CFID is useful in identifying failures where a single realistic output is generated even though there is a diverse set of equally probable outputs.  ( 2 min )
    Model-free Policy Learning with Reward Gradients. (arXiv:2103.05147v3 [cs.LG] UPDATED)
    Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.  ( 2 min )
    A Unified Wasserstein Distributional Robustness Framework for Adversarial Training. (arXiv:2202.13437v1 [cs.LG])
    It is well-known that deep neural networks (DNNs) are susceptible to adversarial attacks, exposing a severe fragility of deep learning systems. As the result, adversarial training (AT) method, by incorporating adversarial examples during training, represents a natural and effective approach to strengthen the robustness of a DNN-based classifier. However, most AT-based methods, notably PGD-AT and TRADES, typically seek a pointwise adversary that generates the worst-case adversarial example by independently perturbing each data sample, as a way to "probe" the vulnerability of the classifier. Arguably, there are unexplored benefits in considering such adversarial effects from an entire distribution. To this end, this paper presents a unified framework that connects Wasserstein distributional robustness with current state-of-the-art AT methods. We introduce a new Wasserstein cost function and a new series of risk functions, with which we show that standard AT methods are special cases of their counterparts in our framework. This connection leads to an intuitive relaxation and generalization of existing AT methods and facilitates the development of a new family of distributional robustness AT-based algorithms. Extensive experiments show that our distributional robustness AT algorithms robustify further their standard AT counterparts in various settings.  ( 2 min )
    Multi-Objective Model-based Reinforcement Learning for Infectious Disease Control. (arXiv:2009.04607v3 [cs.LG] UPDATED)
    Severe infectious diseases such as the novel coronavirus (COVID-19) pose a huge threat to public health. Stringent control measures, such as school closures and stay-at-home orders, while having significant effects, also bring huge economic losses. In the face of an emerging infectious disease, a crucial question for policymakers is how to make the trade-off and implement the appropriate interventions timely given the huge uncertainty. In this work, we propose a Multi-Objective Model-based Reinforcement Learning framework to facilitate data-driven decision-making and minimize the overall long-term cost. Specifically, at each decision point, a Bayesian epidemiological model is first learned as the environment model, and then the proposed model-based multi-objective planning algorithm is applied to find a set of Pareto-optimal policies. This framework, combined with the prediction bands for each policy, provides a real-time decision support tool for policymakers. The application is demonstrated with the spread of COVID-19 in China.  ( 2 min )
    Generalized Label Shift Correction via Minimum Uncertainty Principle: Theory and Algorithm. (arXiv:2202.13043v1 [cs.LG])
    As a fundamental problem in machine learning, dataset shift induces a paradigm to learn and transfer knowledge under changing environment. Previous methods assume the changes are induced by covariate, which is less practical for complex real-world data. We consider the Generalized Label Shift (GLS), which provides an interpretable insight into the learning and transfer of desirable knowledge. Current GLS methods: 1) are not well-connected with the statistical learning theory; 2) usually assume the shifting conditional distributions will be matched with an implicit transformation, but its explicit modeling is unexplored. In this paper, we propose a conditional adaptation framework to deal with these challenges. From the perspective of learning theory, we prove that the generalization error of conditional adaptation is lower than previous covariate adaptation. Following the theoretical results, we propose the minimum uncertainty principle to learn conditional invariant transformation via discrepancy optimization. Specifically, we propose the \textit{conditional metric operator} on Hilbert space to characterize the distinctness of conditional distributions. For finite observations, we prove that the empirical estimation is always well-defined and will converge to underlying truth as sample size increases. The results of extensive experiments demonstrate that the proposed model achieves competitive performance under different GLS scenarios.  ( 2 min )
    Winter wheat yield prediction using convolutional neural networks from environmental and phenological data. (arXiv:2105.01282v2 [cs.LG] UPDATED)
    Crop yield forecasting depends on many interactive factors, including crop genotype, weather, soil, and management practices. This study analyzes the performance of machine learning and deep learning methods for winter wheat yield prediction using an extensive dataset of weather, soil, and crop phenology variables in 271 counties across Germany from 1999 to 2019. We proposed a Convolutional Neural Network (CNN) model, which uses a 1-dimensional convolution operation to capture the time dependencies of environmental variables. We used eight supervised machine learning models as baselines and evaluated their predictive performance using RMSE, MAE, and correlation coefficient metrics to benchmark the yield prediction results. Our findings suggested that nonlinear models such as the proposed CNN, Deep Neural Network (DNN), and XGBoost were more effective in understanding the relationship between the crop yield and input data compared to the linear models. Our proposed CNN model outperformed all other baseline models used for winter wheat yield prediction (7 to 14% lower RMSE, 3 to 15% lower MAE, and 4 to 50% higher correlation coefficient than the best performing baseline across test data). We aggregated soil moisture and meteorological features at the weekly resolution to address the seasonality of the data. We also moved beyond prediction and interpreted the outputs of our proposed CNN model using SHAP and force plots which provided key insights in explaining the yield prediction results (importance of variables by time). We found DUL, wind speed at week ten, and radiation amount at week seven as the most critical features in winter wheat yield prediction.  ( 2 min )
    Learning a Single Neuron with Gradient Methods. (arXiv:2001.05205v3 [cs.LG] UPDATED)
    We consider the fundamental problem of learning a single neuron $x \mapsto\sigma(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $\sigma(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.  ( 2 min )
    PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation. (arXiv:2106.00058v3 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and rich literature has studied its theoretical convergence. The growing popularity of neurally plausible unfolded sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. This paper offers the theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method. We provide sufficient conditions on the network initialization and data distribution for model recovery. We highlight the impact of loss, unfolding, and backpropagation on convergence. We discuss proper initializations and strategies to resolve the instability issue concerning backpropagated gradient. This is achieved through tuning network parameters and modification of loss function. We complement our findings through synthetic and image denoising experiments. Finally, we highlight the interpretability of PULDE by deriving a mathematical relation between network weights, its output, and the training data.  ( 2 min )
    Focus! Rating XAI Methods and Finding Biases. (arXiv:2109.15035v3 [cs.LG] UPDATED)
    AI explainability improves the transparency of models, making them more trustworthy. Such goals are motivated by the emergence of deep learning models, which are obscure by nature; even in the domain of images, where deep learning has succeeded the most, explainability is still poorly assessed. In the field of image recognition many feature attribution methods have been proposed with the purpose of explaining a model's behavior using visual cues. However, no metrics have been established so far to assess and select these methods objectively. In this paper we propose a consistent evaluation score for feature attribution methods -- the Focus -- designed to quantify their coherency to the task. While most previous work adds out-of-distribution noise to samples, we introduce a methodology to add noise from within the distribution. This is done through mosaics of instances from different classes, and the explanations these generate. On those, we compute a visual pseudo-precision metric, Focus. First, we show the robustness of the approach through a set of randomization experiments. Then we use Focus to compare six popular explainability techniques across several CNN architectures and classification datasets. Our results find some methods to be consistently reliable (LRP, GradCAM), while others produce class-agnostic explanations (SmoothGrad, IG). Finally we introduce another application of Focus, using it for the identification and characterization of biases found in models. This empowers bias-management tools, in another small step towards trustworthy AI.  ( 2 min )
    Spatial-temporal Conv-sequence Learning with Accident Encoding for Traffic Flow Prediction. (arXiv:2105.10478v3 [cs.LG] UPDATED)
    In an intelligent transportation system, the key problem of traffic forecasting is how to extract periodic temporal dependencies and complex spatial correlations. Current state-of-the-art methods for predicting traffic flow are based on graph architectures and sequence learning models, but they do not fully exploit dynamic spatial-temporal information in the traffic system. Specifically, the temporal dependencies in the short-range are diluted by recurrent neural networks. Moreover, local spatial information is also ignored by existing sequence models, because their convolution operation uses global average pooling. Besides, accidents may occur during object transition, which will cause congestion in the real world and further decrease prediction accuracy. To overcome these challenges, we propose Spatial-Temporal Conv-sequence Learning (STCL), where a focused temporal block uses unidirectional convolution to capture short-term periodic temporal dependencies effectively, and a patial-temporal fusion module is responsible for extracting dependencies of interactions and decreasing the feature dimensions. Moreover, as the accidents features have an impact on local traffic congestion, we employ position encoding to detect anomalies in complex traffic situations. We have conducted a large number of experiments on real-world tasks and verified the effectiveness of our proposed method.  ( 2 min )
    Quantum Boosting using Domain-Partitioning Hypotheses. (arXiv:2110.12793v2 [quant-ph] UPDATED)
    Boosting is an ensemble learning method that converts a weak learner into a strong learner in the PAC learning framework. Freund and Schapire gave the first classical boosting algorithm for binary hypothesis known as AdaBoost, and this was recently adapted into a quantum boosting algorithm by Arunachalam et al. Their quantum boosting algorithm (which we refer to as Q-AdaBoost) is quadratically faster than the classical version in terms of the VC-dimension of the hypothesis class of the weak learner but polynomially worse in the bias of the weak learner. In this work we design a different quantum boosting algorithm that uses domain partitioning hypotheses that are significantly more flexible than those used in prior quantum boosting algorithms in terms of margin calculations. Our algorithm Q-RealBoost is inspired by the "Real AdaBoost" (aka. RealBoost) extension to the original AdaBoost algorithm. Further, we show that Q-RealBoost provides a polynomial speedup over Q-AdaBoost in terms of both the bias of the weak learner and the time taken by the weak learner to learn the target concept class.  ( 2 min )
    RobustQNN: Noise-Aware Training for Robust Quantum Neural Networks. (arXiv:2110.11331v2 [cs.LG] UPDATED)
    Quantum Neural Network (QNN) is a promising application towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of QNN models has a severe degradation on real quantum devices. For example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of QNN and are only applicable to inference; on the other hand, existing QNN work does not consider noise effect. To this end, we present RoQNN, a QNN-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We analytically deduct and experimentally observe that the effect of quantum noise to QNN measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to QNN according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that RoQNN improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class MNIST classification accuracy measured on real quantum computers. We also open-source our PyTorch library for construction and noise-aware training of QNN at https://github.com/mit-han-lab/torchquantum .  ( 3 min )
    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics. (arXiv:2109.00110v2 [cs.AI] UPDATED)
    We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, Isabelle (partially) and HOL Light (partially) and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.  ( 2 min )
    An Empirical Study on the Intrinsic Privacy of SGD. (arXiv:1912.02919v4 [cs.LG] UPDATED)
    Introducing noise in the training of machine learning systems is a powerful way to protect individual privacy via differential privacy guarantees, but comes at a cost to utility. This work looks at whether the inherent randomness of stochastic gradient descent (SGD) could contribute to privacy, effectively reducing the amount of \emph{additional} noise required to achieve a given privacy guarantee. We conduct a large-scale empirical study to examine this question. Training a grid of over 120,000 models across four datasets (tabular and images) on convex and non-convex objectives, we demonstrate that the random seed has a larger impact on model weights than any individual training example. We test the distribution over weights induced by the seed, finding that the simple convex case can be modelled with a multivariate Gaussian posterior, while neural networks exhibit multi-modal and non-Gaussian weight distributions. By casting convex SGD as a Gaussian mechanism, we then estimate an `intrinsic' data-dependent $\epsilon_i(\mathcal{D})$, finding values as low as 6.3, dropping to 1.9 using empirical estimates. We use a membership inference attack to estimate $\epsilon$ for non-convex SGD and demonstrate that hiding the random seed from the adversary results in a statistically significant reduction in attack performance, corresponding to a reduction in the effective $\epsilon$. These results provide empirical evidence that SGD exhibits appreciable variability relative to its dataset sensitivity, and this `intrinsic noise' has the potential to be leveraged to improve the utility of privacy-preserving machine learning.  ( 2 min )
    Variance Minimization in the Wasserstein Space for Invariant Causal Prediction. (arXiv:2110.07064v2 [cs.LG] UPDATED)
    Selecting powerful predictors for an outcome is a cornerstone task for machine learning. However, some types of questions can only be answered by identifying the predictors that causally affect the outcome. A recent approach to this causal inference problem leverages the invariance property of a causal mechanism across differing experimental environments (Peters et al., 2016; Heinze-Deml et al., 2018). This method, invariant causal prediction (ICP), has a substantial computational defect -- the runtime scales exponentially with the number of possible causal variables. In this work, we show that the approach taken in ICP may be reformulated as a series of nonparametric tests that scales linearly in the number of predictors. Each of these tests relies on the minimization of a novel loss function -- the Wasserstein variance -- that is derived from tools in optimal transport theory and is used to quantify distributional variability across environments. We prove under mild assumptions that our method is able to recover the set of identifiable direct causes, and we demonstrate in our experiments that it is competitive with other benchmark causal discovery algorithms.  ( 2 min )
    A State-of-the-art Survey of Artificial Neural Networks for Whole-slide Image Analysis:from Popular Convolutional Neural Networks to Potential Visual Transformers. (arXiv:2104.06243v3 [eess.IV] UPDATED)
    To increase the objectivity and accuracy of pathologists' work, artificial neural network(ANN) methods have been generally needed in the segmentation, classification, and detection of histopathological WSI. In this paper, WSI analysis methods based on ANN are reviewed. Firstly, the development status of WSI and ANN methods is introduced. Secondly, we summarize the common ANN methods. Next, we discuss publicly available WSI datasets and evaluation metrics. These ANN architectures for WSI processing are divided into classical neural networks and deep neural networks(DNNs) and then analyzed. Finally, the application prospect of the analytical method in this field is discussed. The important potential method is Visual Transformers.  ( 2 min )
    Learning Continuous Exponential Families Beyond Gaussian. (arXiv:2102.09198v2 [cs.LG] UPDATED)
    We address the problem of learning of continuous exponential family distributions with unbounded support. While a lot of progress has been made on learning of Gaussian graphical models, we still lack scalable algorithms for reconstructing general continuous exponential families modeling higher-order moments of the data beyond the mean and the covariance. Here, we introduce a computationally efficient method for learning continuous graphical models based on the Interaction Screening approach. Through a series of numerical experiments, we show that our estimator maintains similar requirements in terms of accuracy and sample complexity scalings compared to alternative approaches such as maximization of conditional likelihood, while considerably improving upon the algorithm's run-time.  ( 2 min )
    Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent. (arXiv:2011.01170v3 [cs.LG] UPDATED)
    Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions.  ( 2 min )
    An End-to-End Computer Vision Methodology for Quantitative Metallography. (arXiv:2104.11159v2 [cond-mat.mtrl-sci] UPDATED)
    Metallography is crucial for a proper assessment of material's properties. It involves mainly the investigation of spatial distribution of grains and the occurrence and characteristics of inclusions or precipitates. This work presents an holistic artificial intelligence model for Anomaly Detection that automatically quantifies the degree of anomaly of impurities in alloys. We suggest the following examination process: (1) Deep semantic segmentation is performed on the inclusions (based on a suitable metallographic database of alloys and corresponding tags of inclusions), producing inclusions masks that are saved into a separated database. (2) Deep image inpainting is performed to fill the removed inclusions parts, resulting in 'clean' metallographic images, which contain the background of grains. (3) Grains' boundaries are marked using deep semantic segmentation (based on another metallographic database of alloys), producing boundaries that are ready for further inspection on the distribution of grains' size. (4) Deep anomaly detection and pattern recognition is performed on the inclusions masks to determine spatial, shape and area anomaly detection of the inclusions. Finally, the system recommends to an expert on areas of interests for further examination. The performance of the model is presented and analyzed based on few representative cases. Although the models presented here were developed for metallography analysis, most of them can be generalized to a wider set of problems in which anomaly detection of geometrical objects is desired. All models as well as the data-sets that were created for this work, are publicly available at https://github.com/Scientific-Computing-Lab-NRCN/MLography.  ( 2 min )
    Neural Contextual Bandits without Regret. (arXiv:2107.03144v2 [stat.ML] UPDATED)
    Contextual bandits are a rich model for sequential decision making given side information, with important applications, e.g., in recommender systems. We propose novel algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function. We resolve the open problem of proving sublinear regret bounds in this setting for general context sequences, considering both fully-connected and convolutional networks. To this end, we first analyze NTK-UCB, a kernelized bandit optimization algorithm employing the Neural Tangent Kernel (NTK), and bound its regret in terms of the NTK maximum information gain $\gamma_T$, a complexity parameter capturing the difficulty of learning. Our bounds on $\gamma_T$ for the NTK may be of independent interest. We then introduce our neural network based algorithm NN-UCB, and show that its regret closely tracks that of NTK-UCB. Under broad non-parametric assumptions about the reward function, our approach converges to the optimal policy at a $\tilde{\mathcal{O}}(T^{-1/2d})$ rate, where $d$ is the dimension of the context.  ( 2 min )
    Model Fusion of Heterogeneous Neural Networks via Cross-Layer Alignment. (arXiv:2110.15538v2 [cs.LG] UPDATED)
    Layer-wise model fusion via optimal transport, named OTFusion, applies soft neuron association for unifying different pre-trained networks to save computational resources. While enjoying its success, OTFusion requires the input networks to have the same number of layers. To address this issue, we propose a novel model fusion framework, named CLAFusion, to fuse neural networks with a different number of layers, which we refer to as heterogeneous neural networks, via cross-layer alignment. The cross-layer alignment problem, which is an unbalanced assignment problem, can be solved efficiently using dynamic programming. Based on the cross-layer alignment, our framework balances the number of layers of neural networks before applying layer-wise model fusion. Our experiments indicate that CLAFusion, with an extra finetuning process, improves the accuracy of residual networks on CIFAR10 and CIFAR100 datasets. Furthermore, we explore its practical usage for model compression and knowledge distillation when applying to the teacher-student setting.  ( 2 min )
    NAAQA: A Neural Architecture for Acoustic Question Answering. (arXiv:2106.06147v2 [cs.CL] UPDATED)
    The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ~17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ~4 times fewer parameters than the previously explored VQA model. We evaluate the perfomance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task.  ( 2 min )
    Optimal Compression of Locally Differentially Private Mechanisms. (arXiv:2111.00092v2 [cs.CR] UPDATED)
    Compressing the output of \epsilon-locally differentially private (LDP) randomizers naively leads to suboptimal utility. In this work, we demonstrate the benefits of using schemes that jointly compress and privatize the data using shared randomness. In particular, we investigate a family of schemes based on Minimal Random Coding (Havasi et al., 2019) and prove that they offer optimal privacy-accuracy-communication tradeoffs. Our theoretical and empirical findings show that our approach can compress PrivUnit (Bhowmick et al., 2018) and Subset Selection (Ye et al., 2018), the best known LDP algorithms for mean and frequency estimation, to to the order of \epsilon-bits of communication while preserving their privacy and accuracy guarantees.  ( 2 min )
    Data Overlap: A Prerequisite For Disentanglement. (arXiv:2202.13341v1 [cs.LG])
    Learning disentangled representations with variational autoencoders (VAEs) is often attributed to the regularisation component of the loss. In this work, we highlight the interaction between data and the reconstruction term of the loss as the main contributor to disentanglement in VAEs. We note that standardised benchmark datasets are constructed in a way that is conducive to learning what appear to be disentangled representations. We design an intuitive adversarial dataset that exploits this mechanism to break existing state-of-the-art disentanglement frameworks. Finally, we provide solutions in the form of a modified reconstruction loss suggesting that VAEs are accidental distance learners.  ( 2 min )
    Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation. (arXiv:2202.10226v2 [cs.IR] UPDATED)
    Model-based methods for recommender systems have been studied extensively for years. Modern recommender systems usually resort to 1) representation learning models which define user-item preference as the distance between their embedding representations, and 2) embedding-based Approximate Nearest Neighbor (ANN) search to tackle the efficiency problem introduced by large-scale corpus. While providing efficient retrieval, the embedding-based retrieval pattern also limits the model capacity since the form of user-item preference measure is restricted to the distance between their embedding representations. However, for other more precise user-item preference measures, e.g., preference scores directly derived from a deep neural network, they are computationally intractable because of the lack of an efficient retrieval method, and an exhaustive search for all user-item pairs is impractical. In this paper, we propose a novel method to extend ANN search to arbitrary matching functions, e.g., a deep neural network. Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items. To solve the problem that the similarity measures of graph construction and user-item matching function are heterogeneous, we propose a pluggable adversarial training task to ensure the graph search with arbitrary matching function can achieve fairly high precision. Experimental results in both open source and industry datasets demonstrate the effectiveness of our method. The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase. We also summarize our detailed experiences in deployment in this paper.  ( 2 min )
    Learning the Beauty in Songs: Neural Singing Voice Beautifier. (arXiv:2202.13277v1 [eess.AS])
    We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. To achieve this, we also propose a new dataset containing parallel singing recordings of both amateur and professional versions. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics. Audio samples are available at~\url{https://neuralsvb.github.io}.  ( 2 min )
    Taming the Long Tail of Deep Probabilistic Forecasting. (arXiv:2202.13418v1 [cs.LG])
    Deep probabilistic forecasting is gaining attention in numerous applications ranging from weather prognosis, through electricity consumption estimation, to autonomous vehicle trajectory prediction. However, existing approaches focus on improvements on the most common scenarios without addressing the performance on rare and difficult cases. In this work, we identify a long tail behavior in the performance of state-of-the-art deep learning methods on probabilistic forecasting. We present two moment-based tailedness measurement concepts to improve performance on the difficult tail examples: Pareto Loss and Kurtosis Loss. Kurtosis loss is a symmetric measurement as the fourth moment about the mean of the loss distribution. Pareto loss is asymmetric measuring right tailedness, modeling the loss using a generalized Pareto distribution (GPD). We demonstrate the performance of our approach on several real-world datasets including time series and spatiotemporal trajectories, achieving significant improvements on the tail examples.  ( 2 min )
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v1 [stat.ML])
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Experimental results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and prediction of biologically meaningful system inputs.  ( 2 min )
    Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization. (arXiv:2202.13212v1 [cs.LG])
    We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm's execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.  ( 2 min )
    The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. (arXiv:2110.07732v2 [cs.LG] UPDATED)
    Despite progress across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depths. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.  ( 2 min )
    SWIS: Self-Supervised Representation Learning For Writer Independent Offline Signature Verification. (arXiv:2202.13078v1 [cs.CV])
    Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets, we obtained encouraging results.  ( 2 min )
    Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games. (arXiv:2102.08903v2 [cs.LG] UPDATED)
    Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces. However, it remains elusive how to obtain optimization and statistical guarantees for such algorithms. We present a new policy optimization algorithm with function approximation and prove that under standard regularity conditions on the Markov game and the function approximation class, our algorithm finds a near-optimal policy within a polynomial number of samples and iterations. To our knowledge, this is the first provably efficient policy optimization algorithm with function approximation that solves two-player zero-sum Markov games.  ( 2 min )
    Machine learning techniques to identify antibiotic resistance in patients diagnosed with various skin and soft tissue infections. (arXiv:2202.13496v1 [cs.LG])
    Skin and soft tissue infections (SSTIs) are among the most frequently observed diseases in ambulatory and hospital settings. Resistance of diverse bacterial pathogens to antibiotics is a significant cause of severe SSTIs, and treatment failure results in morbidity, mortality, and increased cost of hospitalization. Therefore, antimicrobial surveillance is essential to predict antibiotic resistance trends and monitor the results of medical interventions. To address this, we developed machine learning (ML) models (deep and conventional algorithms) to predict antimicrobial resistance using antibiotic susceptibility testing (ABST) data collected from patients clinically diagnosed with primary and secondary pyoderma over a period of one year. We trained an individual ML algorithm on each antimicrobial family to determine whether a Gram-Positive Cocci (GPC) or Gram-Negative Bacilli (GNB) bacteria will resist the corresponding antibiotic. For this purpose, clinical and demographic features from the patient and data from ABST were employed in training. We achieved an Area Under the Curve (AUC) of 0.68-0.98 in GPC and 0.56-0.93 in GNB bacteria, depending on the antimicrobial family. We also conducted a correlation analysis to determine the linear relationship between each feature and antimicrobial families in different bacteria. ML techniques suggest that a predictable nonlinear relationship exists between patients' clinical-demographic characteristics and antibiotic resistance; however, the accuracy of this prediction depends on the type of the antimicrobial family.  ( 2 min )
    ARCA23K: An audio dataset for investigating open-set label noise. (arXiv:2109.09227v2 [cs.SD] UPDATED)
    The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of over 23000 labelled Freesound clips. Unlike past datasets such as FSDKaggle2018 and FSDnoisy18K, ARCA23K facilitates the study of label noise in a more controlled manner. We describe the entire process of creating the dataset such that it is fully reproducible, meaning researchers can extend our work with little effort. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise. Experiments are carried out in which we study the impact of label noise in terms of classification performance and representation learning.  ( 2 min )
    Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect. (arXiv:2110.03677v2 [cs.LG] UPDATED)
    Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory for constant large learning rates well beyond $2/L$, where $L$ is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.  ( 2 min )
    It Takes Two to Tango: Mixup for Deep Metric Learning. (arXiv:2106.04990v2 [cs.LG] UPDATED)
    Metric learning involves learning a discriminative representation such that embeddings of similar classes are encouraged to be close, while embeddings of dissimilar classes are pushed far apart. State-of-the-art methods focus mostly on sophisticated loss functions or mining strategies. On the one hand, metric learning losses consider two or more examples at a time. On the other hand, modern data augmentation methods for classification consider two or more examples at a time. The combination of the two ideas is under-studied. In this work, we aim to bridge this gap and improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. This task is challenging because unlike classification, the loss functions used in metric learning are not additive over examples, so the idea of interpolating target labels is not straightforward. To the best of our knowledge, we are the first to investigate mixing both examples and target labels for deep metric learning. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix. We also introduce a new metric - utilization, to demonstrate that by mixing examples during training, we are exploring areas of the embedding space beyond the training classes, thereby improving representations. To validate the effect of improved representations, we show that mixing inputs, intermediate representations or embeddings along with target labels significantly outperforms state-of-the-art metric learning methods on four benchmark deep metric learning datasets.  ( 2 min )
    Graph-Assisted Communication-Efficient Ensemble Federated Learning. (arXiv:2202.13447v1 [cs.LG])
    Communication efficiency arises as a necessity in federated learning due to limited communication bandwidth. To this end, the present paper develops an algorithmic framework where an ensemble of pre-trained models is learned. At each learning round, the server selects a subset of pre-trained models to construct the ensemble model based on the structure of a graph, which characterizes the server's confidence in the models. Then only the selected models are transmitted to the clients, such that certain budget constraints are not violated. Upon receiving updates from the clients, the server refines the structure of the graph accordingly. The proposed algorithm is proved to enjoy sub-linear regret bound. Experiments on real datasets demonstrate the effectiveness of our novel approach.  ( 2 min )
    Digital Signal Analysis based on Convolutional Neural Networks for Active Target Time Projection Chambers. (arXiv:2202.12941v1 [eess.SP])
    An algorithm for digital signal analysis using convolutional neural networks (CNN) was developed in this work. The main objective of this algorithm is to make the analysis of experiments with active target time projection chambers more efficient. The code is divided in three steps: baseline correction, signal deconvolution and peak detection and integration. The CNNs were able to learn the signal processing models with relative errors of less than 6\%. The analysis based on CNNs provides the same results as the traditional deconvolution algorithms, but considerably more efficient in terms of computing time (about 65 times faster). This opens up new possibilities to improve existing codes and to simplify the analysis of the large amount of data produced in active target experiments.  ( 2 min )
    Socialbots on Fire: Modeling Adversarial Behaviors of Socialbots via Multi-Agent Hierarchical Reinforcement Learning. (arXiv:2110.10655v2 [cs.SI] UPDATED)
    Socialbots are software-driven user accounts on social platforms, acting autonomously (mimicking human behavior), with the aims to influence the opinions of other users or spread targeted misinformation for particular goals. As socialbots undermine the ecosystem of social platforms, they are often considered harmful. As such, there have been several computational efforts to auto-detect the socialbots. However, to our best knowledge, the adversarial nature of these socialbots has not yet been studied. This begs a question "can adversaries, controlling socialbots, exploit AI techniques to their advantage?" To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected. We first formulate the adversarial socialbot learning as a cooperative game between two functional hierarchical RL agents. While one agent curates a sequence of activities that can avoid the detection, the other agent aims to maximize network influence by selectively connecting with right users. Our proposed policy networks train with a vast amount of synthetic graphs and generalize better than baselines on unseen real-life graphs both in terms of maximizing network influence (up to +18%) and sustainable stealthiness (up to +40% undetectability) under a strong bot detector (with 90% detection accuracy). During inference, the complexity of our approach scales linearly, independent of a network's structure and the virality of news. This makes our approach a practical adversarial attack when deployed in a real-life setting.  ( 2 min )
    Variational Interpretable Learning from Multi-view Data. (arXiv:2202.13503v1 [stat.ML])
    The main idea of canonical correlation analysis (CCA) is to map different views onto a common latent space with maximum correlation. We propose a deep interpretable variational canonical correlation analysis (DICCA) for multi-view learning. The developed model extends the existing latent variable model for linear CCA to nonlinear models through the use of deep generative networks. DICCA is designed to disentangle both the shared and view-specific variations for multi-view data. To further make the model more interpretable, we place a sparsity-inducing prior on the latent weight with a structured variational autoencoder that is comprised of view-specific generators. Empirical results on real-world datasets show that our methods are competitive across domains.  ( 2 min )
    Sample-Efficient Safety Assurances using Conformal Prediction. (arXiv:2109.14082v2 [cs.RO] UPDATED)
    When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than $\epsilon$ will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an $\epsilon$ false negative rate using as few as $1/\epsilon$ data points. We apply our framework to a driver warning system and a robotic grasping application, and empirically demonstrate guaranteed false negative rate while also observing low false detection (positive) rate.  ( 2 min )
    Domain Knowledge-Based Automated Analog Circuit Design with Deep Reinforcement Learning. (arXiv:2202.13185v1 [cs.LG])
    The design automation of analog circuits is a longstanding challenge in the integrated circuit field. This paper presents a deep reinforcement learning method to expedite the design of analog circuits at the pre-layout stage, where the goal is to find device parameters to fulfill desired circuit specifications. Our approach is inspired by experienced human designers who rely on domain knowledge of analog circuit design (e.g., circuit topology and couplings between circuit specifications) to tackle the problem. Unlike all prior methods, our method originally incorporates such key domain knowledge into policy learning with a graph-based policy network, thereby best modeling the relations between circuit parameters and design targets. Experimental results on exemplary circuits show it achieves human-level design accuracy (~99%) with 1.5x efficiency of existing best-performing methods. Our method also shows better generalization ability to unseen specifications and optimality in circuit performance optimization. Moreover, it applies to designing diverse analog circuits across different semiconductor technologies, breaking the limitations of prior ad-hoc methods in designing one particular type of analog circuits with conventional semiconductor technology.  ( 2 min )
    Learning stochastic object models from medical imaging measurements by use of advanced ambient generative adversarial networks. (arXiv:2106.14324v2 [eess.IV] UPDATED)
    Purpose: To objectively assess new medical imaging technologies via computer-simulations, it is important to account for the variability in the ensemble of objects to be imaged. This source of variability can be described by stochastic object models (SOMs). It is generally desirable to establish SOMs from experimental imaging measurements acquired by use of a well-characterized imaging system, but this task has remained challenging. Approach: A generative adversarial network (GAN)-based method that employs AmbientGANs with modern progressive or multiresolution training approaches is proposed. AmbientGANs established using the proposed training procedure are systematically validated in a controlled way using computer-simulated magnetic resonance imaging (MRI) data corresponding to a stylized imaging system. Emulated single-coil experimental MRI data are also employed to demonstrate the methods under less stylized conditions. Results: The proposed AmbientGAN method can generate clean images when the imaging measurements are contaminated by measurement noise. When the imaging measurement data are incomplete, the proposed AmbientGAN can reliably learn the distribution of the measurement components of the objects. Conclusions: Both visual examinations and quantitative analyses, including task-specific validations using the Hotelling observer, demonstrated that the proposed AmbientGAN method holds promise to establish realistic SOMs from imaging measurements.  ( 2 min )
    Silhouettes and quasi residual plots for neural nets and tree-based classifiers. (arXiv:2106.08814v2 [stat.ML] UPDATED)
    Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis (Rousseeuw, 1987). The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on benchmark data sets containing images, mixed features, and tweets.  ( 2 min )
    Meta-path Analysis on Spatio-Temporal Graphs for Pedestrian Trajectory Prediction. (arXiv:2202.13427v1 [cs.LG])
    Spatio-temporal graphs (ST-graphs) have been used to model time series tasks such as traffic forecasting, human motion modeling, and action recognition. The high-level structure and corresponding features from ST-graphs have led to improved performance over traditional architectures. However, current methods tend to be limited by simple features, despite the rich information provided by the full graph structure, which leads to inefficiencies and suboptimal performance in downstream tasks. We propose the use of features derived from meta-paths, walks across different types of edges, in ST-graphs to improve the performance of Structural Recurrent Neural Network. In this paper, we present the Meta-path Enhanced Structural Recurrent Neural Network (MESRNN), a generic framework that can be applied to any spatio-temporal task in a simple and scalable manner. We employ MESRNN for pedestrian trajectory prediction, utilizing these meta-path based features to capture the relationships between the trajectories of pedestrians at different points in time and space. We compare our MESRNN against state-of-the-art ST-graph methods on standard datasets to show the performance boost provided by meta-path information. The proposed model consistently outperforms the baselines in trajectory prediction over long time horizons by over 32\%, and produces more socially compliant trajectories in dense crowds. For more information please refer to the project website at https://sites.google.com/illinois.edu/mesrnn/home.  ( 2 min )
    Exploring with Sticky Mittens: Reinforcement Learning with Expert Interventions via Option Templates. (arXiv:2202.12967v1 [cs.LG])
    Environments with sparse rewards and long horizons pose a significant challenge for current reinforcement learning algorithms. A key feature enabling humans to learn challenging control tasks is that they often receive expert intervention that enables them to understand the high-level structure of the task before mastering low-level control actions. We propose a framework for leveraging expert intervention to solve long-horizon reinforcement learning tasks. We consider option templates, which are specifications encoding a potential option that can be trained using reinforcement learning. We formulate expert intervention as allowing the agent to execute option templates before learning an implementation. This enables them to use an option, before committing costly resources to learning it. We evaluate our approach on three challenging reinforcement learning problems, showing that it outperforms state of-the-art approaches by an order of magnitude. Project website at https://sites.google.com/view/stickymittens  ( 2 min )
    A Computer Vision-assisted Approach to Automated Real-Time Road Infrastructure Management. (arXiv:2202.13285v1 [cs.CV])
    Accurate automated detection of road pavement distresses is critical for the timely identification and repair of potentially accident-inducing road hazards such as potholes and other surface-level asphalt cracks. Deployment of such a system would be further advantageous in low-resource environments where lack of government funding for infrastructure maintenance typically entails heightened risks of potentially fatal vehicular road accidents as a result of inadequate and infrequent manual inspection of road systems for road hazards. To remedy this, a recent research initiative organized by the Institute of Electrical and Electronics Engineers ("IEEE") as part of their 2020 Global Road Damage Detection ("GRDC") Challenge published in May 2020 a novel 21,041 annotated image dataset of various road distresses calling upon academic and other researchers to submit innovative deep learning-based solutions to these road hazard detection problems. Making use of this dataset, we propose a supervised object detection approach leveraging You Only Look Once ("YOLO") and the Faster R-CNN frameworks to detect and classify road distresses in real-time via a vehicle dashboard-mounted smartphone camera, producing 0.68 F1-score experimental results ranking in the top 5 of 121 teams that entered this challenge as of December 2021.  ( 2 min )
    Safe Exploration for Efficient Policy Evaluation and Comparison. (arXiv:2202.13234v1 [cs.LG])
    High-quality data plays a central role in ensuring the accuracy of policy evaluation. This paper initiates the study of efficient and safe data collection for bandit policy evaluation. We formulate the problem and investigate its several representative variants. For each variant, we analyze its statistical properties, derive the corresponding exploration policy, and design an efficient algorithm for computing it. Both theoretical analysis and experiments support the usefulness of the proposed methods.  ( 2 min )
    The Spectral Bias of Polynomial Neural Networks. (arXiv:2202.13473v1 [cs.LG])
    Polynomial neural networks (PNNs) have been recently shown to be particularly effective at image generation and face recognition, where high-frequency information is critical. Previous studies have revealed that neural networks demonstrate a $\textit{spectral bias}$ towards low-frequency functions, which yields faster learning of low-frequency components during training. Inspired by such studies, we conduct a spectral analysis of the Neural Tangent Kernel (NTK) of PNNs. We find that the $\Pi$-Net family, i.e., a recently proposed parametrization of PNNs, speeds up the learning of the higher frequencies. We verify the theoretical bias through extensive experiments. We expect our analysis to provide novel insights into designing architectures and learning frameworks by incorporating multiplicative interactions via polynomials.  ( 2 min )
    Dropout can Simulate Exponential Number of Models for Sample Selection Techniques. (arXiv:2202.13203v1 [cs.LG])
    Following Coteaching, generally in the literature, two models are used in sample selection based approaches for training with noisy labels. Meanwhile, it is also well known that Dropout when present in a network trains an ensemble of sub-networks. We show how to leverage this property of Dropout to train an exponential number of shared models, by training a single model with Dropout. We show how we can modify existing two model-based sample selection methodologies to use an exponential number of shared models. Not only is it more convenient to use a single model with Dropout, but this approach also combines the natural benefits of Dropout with that of training an exponential number of models, leading to improved results.  ( 2 min )
    Refining Self-Supervised Learning in Imaging: Beyond Linear Metric. (arXiv:2202.12921v1 [cs.CV])
    We introduce in this paper a new statistical perspective, exploiting the Jaccard similarity metric, as a measure-based metric to effectively invoke non-linear features in the loss of self-supervised contrastive learning. Specifically, our proposed metric may be interpreted as a dependence measure between two adapted projections learned from the so-called latent representations. This is in contrast to the cosine similarity measure in the conventional contrastive learning model, which accounts for correlation information. To the best of our knowledge, this effectively non-linearly fused information embedded in the Jaccard similarity, is novel to self-supervision learning with promising results. The proposed approach is compared to two state-of-the-art self-supervised contrastive learning methods on three image datasets. We not only demonstrate its amenable applicability in current ML problems, but also its improved performance and training efficiency.  ( 2 min )
    Point Label Aware Superpixels for Multi-species Segmentation of Underwater Imagery. (arXiv:2202.13487v1 [cs.CV])
    Monitoring coral reefs using underwater vehicles increases the range of marine surveys and availability of historical ecological data by collecting significant quantities of images. Analysis of this imagery can be automated using a model trained to perform semantic segmentation, however it is too costly and time-consuming to densely label images for training supervised models. In this letter, we leverage photo-quadrat imagery labeled by ecologists with sparse point labels. We propose a point label aware method for propagating labels within superpixel regions to obtain augmented ground truth for training a semantic segmentation model. Our point label aware superpixel method utilizes the sparse point labels, and clusters pixels using learned features to accurately generate single-species segments in cluttered, complex coral images. Our method outperforms prior methods on the UCSD Mosaics dataset by 3.62% for pixel accuracy and 8.35% for mean IoU for the label propagation task. Furthermore, our approach reduces computation time reported by previous approaches by 76%. We train a DeepLabv3+ architecture and outperform state-of-the-art for semantic segmentation by 2.91% for pixel accuracy and 9.65% for mean IoU on the UCSD Mosaics dataset and by 4.19% for pixel accuracy and 14.32% mean IoU for the Eilat dataset.  ( 2 min )
    Semantic Supervision: Enabling Generalization over Output Spaces. (arXiv:2202.13100v1 [cs.LG])
    In this paper, we propose Semantic Supervision (SemSup) - a unified paradigm for training classifiers that generalize over output spaces. In contrast to standard classification, which treats classes as discrete symbols, SemSup represents them as dense vector features obtained from descriptions of classes (e.g., "The cat is a small carnivorous mammal"). This allows the output space to be unbounded (in the space of descriptions) and enables models to generalize both over unseen inputs and unseen outputs (e.g. "The aardvark is a nocturnal burrowing mammal with long ears"). Specifically, SemSup enables four types of generalization, to -- (1) unseen class descriptions, (2) unseen classes, (3) unseen super-classes, and (4) unseen tasks. Through experiments on four classification datasets across two variants (multi-class and multi-label), two input modalities (text and images), and two output description modalities (text and JSON), we show that our SemSup models significantly outperform standard supervised models and existing models that leverage word embeddings over class names. For instance, our model outperforms baselines by 40% and 20% precision points on unseen descriptions and classes, respectively, on a news categorization dataset (RCV1). SemSup can serve as a pathway for scaling neural models to large unbounded output spaces and enabling better generalization and model reuse for unseen tasks and domains.  ( 2 min )
    An Empirical Study on Text-Independent Speaker Verification based on the GE2E Method. (arXiv:2011.04896v4 [eess.AS] UPDATED)
    While many researchers in the speaker recognition area have started to replace the former classical state-of-the-art methods with deep learning techniques, some of the traditional i-vector-based methods are still state-of-the-art in the context of text-independent speaker verification. Google's Generalized End-to-End Loss for Speaker Verification (GE2E), a deep learning-based technique using long short-term memory units, has recently gained a lot of attention due to its speed in convergence and generalization. In this study, we aim at further studying the GE2E method and comparing different scenarios in order to investigate all of its aspects. Various experiments including the effects of a random sampling of test and enrollment utterances, test utterance duration, and the number of enrollment utterances are discussed in this article. Furthermore, we compare the GE2E method with the baseline state-of-the-art i-vector-based methods for text-independent speaker verification and show that it outperforms them by resulting in lower error rates while being end-to-end and requiring less training time for convergence.  ( 2 min )
    Semi-Supervised Learning and Data Augmentation in Wearable-based Momentary Stress Detection in the Wild. (arXiv:2202.12935v1 [eess.SP])
    Physiological and behavioral data collected from wearable or mobile sensors have been used to estimate self-reported stress levels. Since the stress annotation usually relies on self-reports during the study, a limited amount of labeled data can be an obstacle in developing accurate and generalized stress predicting models. On the other hand, the sensors can continuously capture signals without annotations. This work investigates leveraging unlabeled wearable sensor data for stress detection in the wild. We first applied data augmentation techniques on the physiological and behavioral data to improve the robustness of supervised stress detection models. Using an auto-encoder with actively selected unlabeled sequences, we pre-trained the supervised model structure to leverage the information learned from unlabeled samples. Then, we developed a semi-supervised learning framework to leverage the unlabeled data sequences. We combined data augmentation techniques with consistency regularization, which enforces the consistency of prediction output based on augmented and original unlabeled data. We validated these methods using three wearable/mobile sensor datasets collected in the wild. Our results showed that combining the proposed methods improved stress classification performance by 7.7% to 13.8% on the evaluated datasets, compared to the baseline supervised learning models.  ( 2 min )
    Automated Identification of Toxic Code Reviews: How Far Can We Go?. (arXiv:2202.13056v1 [cs.SE])
    Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) dataset, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, five mandatory and three optional SE domain specific processing steps, and a large scale labeled dataset of 19,571 code review comments. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pretrained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR.  ( 2 min )
    FSGANv2: Improved Subject Agnostic Face Swapping and Reenactment. (arXiv:2202.12972v1 [cs.CV])
    We present Face Swapping GAN (FSGAN) for face swapping and reenactment. Unlike previous work, we offer a subject agnostic swapping scheme that can be applied to pairs of faces without requiring training on those faces. We derive a novel iterative deep learning--based approach for face reenactment which adjusts significant pose and expression variations that can be applied to a single image or a video sequence. For video sequences, we introduce a continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving the target skin color and lighting conditions. This network uses a novel Poisson blending loss combining Poisson optimization with a perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior. This work describes extensions of the FSGAN method, proposed in an earlier conference version of our work, as well as additional experiments and results.  ( 2 min )
    Evaluating High-Order Predictive Distributions in Deep Learning. (arXiv:2202.13509v1 [stat.ML])
    Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not. We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical. To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs. We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.  ( 2 min )
    Bayesian Active Learning for Discrete Latent Variable Models. (arXiv:2202.13426v1 [cs.LG])
    Active learning seeks to reduce the number of samples required to estimate the parameters of a model, thus forming an important class of techniques in modern machine learning. However, past work on active learning has largely overlooked latent variable models, which play a vital role in neuroscience, psychology, and a variety of other engineering and scientific disciplines. Here we address this gap in the literature and propose a novel framework for maximum-mutual-information input selection for learning discrete latent variable regression models. We first examine a class of models known as "mixtures of linear regressions" (MLR). This example is striking because it is well known that active learning confers no advantage for standard least-squares regression. However, we show -- both in simulations and analytically using Fisher information -- that optimal input selection can nevertheless provide dramatic gains for mixtures of regression models; we also validate this on a real-world application of MLRs. We then consider a powerful class of temporally structured latent variable models known as Input-Output Hidden Markov Models (IO-HMMs), which have recently gained prominence in neuroscience. We show that our method substantially speeds up learning, and outperforms a variety of approximate methods based on variational and amortized inference.  ( 2 min )
    Automated Parkinson's Disease Detection and Affective Analysis from Emotional EEG Signals. (arXiv:2202.12936v1 [eess.SP])
    While Parkinson's disease (PD) is typically characterized by motor disorder, there is evidence of diminished emotion perception in PD patients. This study examines the utility of affective Electroencephalography (EEG) signals to understand emotional differences between PD vs Healthy Controls (HC), and for automated PD detection. Employing traditional machine learning and deep learning methods, we explore (a) dimensional and categorical emotion recognition, and (b) PD vs HC classification from emotional EEG signals. Our results reveal that PD patients comprehend arousal better than valence, and amongst emotion categories, \textit{fear}, \textit{disgust} and \textit{surprise} less accurately, and \textit{sadness} most accurately. Mislabeling analyses confirm confounds among opposite-valence emotions with PD data. Emotional EEG responses also achieve near-perfect PD vs HC recognition. {Cumulatively, our study demonstrates that (a) examining \textit{implicit} responses alone enables (i) discovery of valence-related impairments in PD patients, and (ii) differentiation of PD from HC, and (b) emotional EEG analysis is an ecologically-valid, effective, facile and sustainable tool for PD diagnosis vis-\'a-vis self reports, expert assessments and resting-state analysis.}  ( 2 min )
    Off-Policy Evaluation with Policy-Dependent Optimization Response. (arXiv:2202.12958v1 [cs.LG])
    The intersection of causal inference and machine learning for decision-making is rapidly expanding, but the default decision criterion remains an \textit{average} of individual causal outcomes across a population. In practice, various operational restrictions ensure that a decision-maker's utility is not realized as an \textit{average} but rather as an \textit{output} of a downstream decision-making problem (such as matching, assignment, network flow, minimizing predictive risk). In this work, we develop a new framework for off-policy evaluation with a \textit{policy-dependent} linear optimization response: causal outcomes introduce stochasticity in objective function coefficients. In this framework, a decision-maker's utility depends on the policy-dependent optimization, which introduces a fundamental challenge of \textit{optimization} bias even for the case of policy evaluation. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We also discuss the asymptotic variance properties for a set of plug-in regression estimators adjusted to be compatible with that perturbation method. Lastly, attaining unbiased policy evaluation allows for policy optimization, and we provide a general algorithm for optimizing causal interventions. We corroborate our theoretical results with numerical simulations.  ( 2 min )
    Neuro-Inspired Deep Neural Networks with Sparse, Strong Activations. (arXiv:2202.13074v1 [cs.NE])
    While end-to-end training of Deep Neural Networks (DNNs) yields state of the art performance in an increasing array of applications, it does not provide insight into, or control over, the features being extracted. We report here on a promising neuro-inspired approach to DNNs with sparser and stronger activations. We use standard stochastic gradient training, supplementing the end-to-end discriminative cost function with layer-wise costs promoting Hebbian ("fire together," "wire together") updates for highly active neurons, and anti-Hebbian updates for the remaining neurons. Instead of batch norm, we use divisive normalization of activations (suppressing weak outputs using strong outputs), along with implicit $\ell_2$ normalization of neuronal weights. Experiments with standard image classification tasks on CIFAR-10 demonstrate that, relative to baseline end-to-end trained architectures, our proposed architecture (a) leads to sparser activations (with only a slight compromise on accuracy), (b) exhibits more robustness to noise (without being trained on noisy data), (c) exhibits more robustness to adversarial perturbations (without adversarial training).  ( 2 min )
    Adversarial Contrastive Self-Supervised Learning. (arXiv:2202.13072v1 [cs.CV])
    Recently, learning from vast unlabeled data, especially self-supervised learning, has been emerging and attracted widespread attention. Self-supervised learning followed by the supervised fine-tuning on a few labeled examples can significantly improve label efficiency and outperform standard supervised training using fully annotated data. In this work, we present a novel self-supervised deep learning paradigm based on online hard negative pair mining. Specifically, we design a student-teacher network to generate multi-view of the data for self-supervised learning and integrate hard negative pair mining into the training. Then we derive a new triplet-like loss considering both positive sample pairs and mined hard negative sample pairs. Extensive experiments demonstrate the effectiveness of the proposed method and its components on ILSVRC-2012.  ( 2 min )
    Limitations of Deep Learning for Inverse Problems on Digital Hardware. (arXiv:2202.13490v1 [cs.LG])
    Deep neural networks have seen tremendous success over the last years. Since the training is performed on digital hardware, in this paper, we analyze what actually can be computed on current hardware platforms modeled as Turing machines, which would lead to inherent restrictions of deep learning. For this, we focus on the class of inverse problems, which, in particular, encompasses any task to reconstruct data from measurements. We prove that finite-dimensional inverse problems are not Banach-Mazur computable for small relaxation parameters. In fact, our result even holds for Borel-Turing computability., i.e., there does not exist an algorithm which performs the training of a neural network on digital hardware for any given accuracy. This establishes a conceptual barrier on the capabilities of neural networks for finite-dimensional inverse problems given that the computations are performed on digital hardware.  ( 2 min )
    ONE-NAS: An Online NeuroEvolution based Neural Architecture Search for Time Series Forecasting. (arXiv:2202.13471v1 [cs.LG])
    Time series forecasting (TSF) is one of the most important tasks in data science, as accurate time series (TS) predictions can drive and advance a wide variety of domains including finance, transportation, health care, and power systems. However, real-world utilization of machine learning (ML) models for TSF suffers due to pretrained models being able to learn and adapt to unpredictable patterns as previously unseen data arrives over longer time scales. To address this, models must be periodically retained or redesigned, which takes significant human and computational resources. This work presents the Online NeuroEvolution based Neural Architecture Search (ONE-NAS) algorithm, which to the authors' knowledge is the first neural architecture search algorithm capable of automatically designing and training new recurrent neural networks (RNNs) in an online setting. Without any pretraining, ONE-NAS utilizes populations of RNNs which are continuously updated with new network structures and weights in response to new multivariate input data. ONE-NAS is tested on real-world large-scale multivariate wind turbine data as well a univariate Dow Jones Industrial Average (DJIA) dataset, and is shown to outperform traditional statistical time series forecasting, including naive, moving average, and exponential smoothing methods, as well as state of the art online ARIMA strategies.  ( 2 min )
    Automated Data Augmentations for Graph Classification. (arXiv:2202.13248v1 [cs.LG])
    Data augmentations are effective in improving the invariance of learning machines. We argue that the corechallenge of data augmentations lies in designing data transformations that preserve labels. This is relativelystraightforward for images, but much more challenging for graphs. In this work, we propose GraphAug, a novelautomated data augmentation method aiming at computing label-invariant augmentations for graph classification.Instead of using uniform transformations as in existing studies, GraphAug uses an automated augmentationmodel to avoid compromising critical label-related information of the graph, thereby producing label-invariantaugmentations at most times. To ensure label-invariance, we develop a training method based on reinforcementlearning to maximize an estimated label-invariance probability. Comprehensive experiments show that GraphAugoutperforms previous graph augmentation methods on various graph classification tasks.
    Vertical Machine Unlearning: Selectively Removing Sensitive Information From Latent Feature Space. (arXiv:2202.13295v1 [cs.LG])
    Recently, the enactment of privacy regulations has promoted the rise of machine unlearning paradigm. Most existing studies mainly focus on removing unwanted data samples from a learnt model. Yet we argue that they remove overmuch information of data samples from latent feature space, which is far beyond the sensitive feature scope that genuinely needs to be unlearned. In this paper, we investigate a vertical unlearning mode, aiming at removing only sensitive information from latent feature space. First, we introduce intuitive and formal definitions for this unlearning and show its orthogonal relationship with existing horizontal unlearning. Secondly, given the fact of lacking general solutions to vertical unlearning, we introduce a ground-breaking solution based on representation detachment, where the task-related information is encouraged to retain while the sensitive information is progressively forgotten. Thirdly, observing that some computation results during representation detachment are hard to obtain in practice, we propose an approximation with an upper bound to estimate it, with rigorous theoretical analysis. We validate our method by spanning several datasets and models with prevailing performance. We envision this work as a necessity for future machine unlearning system and an essential component of the latest privacy-related legislation.  ( 2 min )
    Integrated multimodal artificial intelligence framework for healthcare applications. (arXiv:2202.12998v1 [cs.LG])
    Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on MIMIC-IV-MM, a multimodal clinical database (N=34,537 samples) containing 7,279 unique hospitalizations and 6,485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6-33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48-hour mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data type importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.  ( 2 min )
    A blob method method for inhomogeneous diffusion with applications to multi-agent control and sampling. (arXiv:2202.12927v1 [math.AP])
    As a counterpoint to classical stochastic particle methods for linear diffusion equations, we develop a deterministic particle method for the weighted porous medium equation (WPME) and prove its convergence on bounded time intervals. This generalizes related work on blob methods for unweighted porous medium equations. From a numerical analysis perspective, our method has several advantages: it is meshfree, preserves the gradient flow structure of the underlying PDE, converges in arbitrary dimension, and captures the correct asymptotic behavior in simulations. That our method succeeds in capturing the long time behavior of WPME is significant from the perspective of related problems in quantization. Just as the Fokker-Planck equation provides a way to quantize a probability measure $\bar{\rho}$ by evolving an empirical measure according to stochastic Langevin dynamics so that the empirical measure flows toward $\bar{\rho}$, our particle method provides a way to quantize $\bar{\rho}$ according to deterministic particle dynamics approximating WMPE. In this way, our method has natural applications to multi-agent coverage algorithms and sampling probability measures. A specific case of our method corresponds exactly to the mean-field dynamics of training a two-layer neural network for a radial basis function activation function. From this perspective, our convergence result shows that, in the over parametrized regime and as the variance of the radial basis functions goes to zero, the continuum limit is given by WPME. This generalizes previous results, which considered the case of a uniform data distribution, to the more general inhomogeneous setting. As a consequence of our convergence result, we identify conditions on the target function and data distribution for which convexity of the energy landscape emerges in the continuum limit.  ( 2 min )
    Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition. (arXiv:2202.12949v1 [eess.SP])
    As a fundamental problem in ubiquitous computing and machine learning, sensor-based human activity recognition (HAR) has drawn extensive attention and made great progress in recent years. HAR aims to recognize human activities based on the availability of rich time-series data collected from multi-modal sensors such as accelerometers and gyroscopes. However, recent deep learning methods are focusing on one view of the data, i.e., the temporal view, while shallow methods tend to utilize the hand-craft features for recognition, e.g., the statistics view. In this paper, to extract a better feature for advancing the performance, we propose a novel method, namely multi-view fusion transformer (MVFT) along with a novel attention mechanism. First, MVFT encodes three views of information, i.e., the temporal, frequent, and statistical views to generate multi-view features. Second, the novel attention mechanism uncovers inner- and cross-view clues to catalyze mutual interactions between three views for detailed relation modeling. Moreover, extensive experiments on two datasets illustrate the superiority of our methods over several state-of-the-art methods.  ( 2 min )
    Sign and Basis Invariant Networks for Spectral Graph Representation Learning. (arXiv:2202.13013v1 [cs.LG])
    Many machine learning tasks involve processing eigenvectors derived from data. Especially valuable are Laplacian eigenvectors, which capture useful structural information about graphs and other geometric objects. However, ambiguities arise when computing eigenvectors: for each eigenvector $v$, the sign flipped $-v$ is also an eigenvector. More generally, higher dimensional eigenspaces contain infinitely many choices of basis eigenvectors. These ambiguities make it a challenge to process eigenvectors and eigenspaces in a consistent way. In this work we introduce SignNet and BasisNet -- new neural architectures that are invariant to all requisite symmetries and hence process collections of eigenspaces in a principled manner. Our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the proper invariances. They are also theoretically strong for graph representation learning -- they can approximate any spectral graph convolution, can compute spectral invariants that go beyond message passing neural networks, and can provably simulate previously proposed graph positional encodings. Experiments show the strength of our networks for learning spectral graph filters and learning graph positional encodings.  ( 2 min )
    Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering. (arXiv:2202.13140v1 [cs.LG])
    Over the past decades, for One-Class Collaborative Filtering (OCCF), many learning objectives have been researched based on a variety of underlying probabilistic models. From our analysis, we observe that models trained with different OCCF objectives capture distinct aspects of user-item relationships, which in turn produces complementary recommendations. This paper proposes a novel OCCF framework, named ConCF, that exploits the complementarity from heterogeneous objectives throughout the training process, generating a more generalizable model. ConCF constructs a multi-branch variant of a given target model by adding auxiliary heads, each of which is trained with heterogeneous objectives. Then, it generates consensus by consolidating the various views from the heads, and guides the heads based on the consensus. The heads are collaboratively evolved based on their complementarity throughout the training, which again results in generating more accurate consensus iteratively. After training, we convert the multi-branch architecture back to the original target model by removing the auxiliary heads, thus there is no extra inference cost for the deployment. Our extensive experiments on real-world datasets demonstrate that ConCF significantly improves the generalization of the model by exploiting the complementarity from heterogeneous objectives.  ( 2 min )
    Does Label Differential Privacy Prevent Label Inference Attacks?. (arXiv:2202.12968v1 [cs.LG])
    Label differential privacy (LDP) is a popular framework for training private ML models on datasets with public features and sensitive private labels. Despite its rigorous privacy guarantee, it has been observed that in practice LDP does not preclude label inference attacks (LIAs): Models trained with LDP can be evaluated on the public training features to recover, with high accuracy, the very private labels that it was designed to protect. In this work, we argue that this phenomenon is not paradoxical and that LDP merely limits the advantage of an LIA adversary compared to predicting training labels using the Bayes classifier. At LDP $\epsilon=0$ this advantage is zero, hence the optimal attack is to predict according to the Bayes classifier and is independent of the training labels. Finally, we empirically demonstrate that our result closely captures the behavior of simulated attacks on both synthetic and real world datasets.  ( 2 min )
    Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework. (arXiv:2202.13227v1 [cs.LG])
    Online learning in large-scale structured bandits is known to be challenging due to the curse of dimensionality. In this paper, we propose a unified meta-learning framework for a general class of structured bandit problems where the parameter space can be factorized to item-level. The novel bandit algorithm is general to be applied to many popular problems,scalable to the huge parameter and action spaces, and robust to the specification of the generalization model. At the core of this framework is a Bayesian hierarchical model that allows information sharing among items via their features, upon which we design a meta Thompson sampling algorithm. Three representative examples are discussed thoroughly. Both theoretical analysis and numerical results support the usefulness of the proposed method.  ( 2 min )
    Towards Robust Off-policy Learning for Runtime Uncertainty. (arXiv:2202.13337v1 [cs.LG])
    Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment. However, during the real-time serving, we observe varieties of interventions and constraints that cause inconsistency between the online and offline settings, which we summarize and term as runtime uncertainty. Such uncertainty cannot be learned from the logged data due to its abnormality and rareness nature. To assert a certain level of robustness, we perturb the off-policy estimators along an adversarial direction in view of the runtime uncertainty. It allows the resulting estimators to be robust not only to observed but also unexpected runtime uncertainties. Leveraging this idea, we bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method. We theoretically justify the robustness of our methods to runtime uncertainty, and demonstrate their effectiveness using both the simulation and the real-world online experiments.  ( 2 min )
    OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs. (arXiv:2202.12929v1 [cs.CV])
    Text-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. It can be potentially employed in the field of art creation, data augmentation, photo-editing, etc. Although many efforts have been dedicated to this task, it remains particularly challenging to generate believable, natural scenes. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: 1) How to ensure that generated samples are believable, realistic or natural? 2) How to exploit the latent space of the generator to edit a synthesized image? 3) How to improve the explainability of a text-to-image generation framework? In this work, we constructed two novel data sets (i.e., the Good & Bad bird and face data sets) consisting of successful as well as unsuccessful generated samples, according to strict criteria. To effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes, we use a dedicated Good/Bad classifier for generated images. It is based on a pre-trained front end and fine-tuned on the basis of the proposed Good & Bad data set. After that, we present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the edited image. Subsequently, we introduce linear interpolation analysis between pairs of keywords. This is extended into a similar triangular `linguistic' interpolation in order to take a deep look into what a text-to-image synthesis model has learned within the linguistic embeddings. Our data set is available at https://zenodo.org/record/6283798#.YhkN_ujMI2w.  ( 2 min )
    Direct data-driven forecast of local turbulent heat flux in Rayleigh-B\'{e}nard convection. (arXiv:2202.13129v1 [physics.flu-dyn])
    A combined convolutional autoencoder-recurrent neural network machine learning model is presented to analyse and forecast the dynamics and low-order statistics of the local convective heat flux field in a two-dimensional turbulent Rayleigh-B\'{e}nard convection flow at Prandtl number ${\rm Pr}=7$ and Rayleigh number ${\rm Ra}=10^7$. Two recurrent neural networks are applied for the temporal advancement of flow data in the reduced latent data space, a reservoir computing model in the form of an echo state network and a recurrent gated unit. Thereby, the present work exploits the modular combination of three different machine learning algorithms to build a fully data-driven and reduced model for the dynamics of the turbulent heat transfer in a complex thermally driven flow. The convolutional autoencoder with 12 hidden layers is able to reduce the dimensionality of the turbulence data to about 0.2 \% of their original size. Our results indicate a fairly good accuracy in the first- and second-order statistics of the convective heat flux. The algorithm is also able to reproduce the intermittent plume-mixing dynamics at the upper edges of the thermal boundary layers with some deviations. The same holds for the probability density function of the local convective heat flux with differences in the far tails. Furthermore, we demonstrate the noise resilience of the framework which suggests the present model might be applicable as a reduced dynamical model that delivers transport fluxes and their variations to the coarse grid cells of larger-scale computational models, such as global circulation models for the atmosphere and ocean.  ( 2 min )
    Sampling in Dirichlet Process Mixture Models for Clustering Streaming Data. (arXiv:2202.13312v1 [cs.LG])
    Practical tools for clustering streaming data must be fast enough to handle the arrival rate of the observations. Typically, they also must adapt on the fly to possible lack of stationarity; i.e., the data statistics may be time-dependent due to various forms of drifts, changes in the number of clusters, etc. The Dirichlet Process Mixture Model (DPMM), whose Bayesian nonparametric nature allows it to adapt its complexity to the data, seems a natural choice for the streaming-data case. In its classical formulation, however, the DPMM cannot capture common types of drifts in the data statistics. Moreover, and regardless of that limitation, existing methods for online DPMM inference are too slow to handle rapid data streams. In this work we propose adapting both the DPMM and a known DPMM sampling-based non-streaming inference method for streaming-data clustering. We demonstrate the utility of the proposed method on several challenging settings, where it obtains state-of-the-art results while being on par with other methods in terms of speed.  ( 2 min )
    Deep Learning-Based Inverse Design for Engineering Systems: Multidisciplinary Design Optimization of Automotive Brakes. (arXiv:2202.13309v1 [cs.LG])
    The braking performance of the brake system is a target performance that must be considered for vehicle development. Apparent piston travel (APT) and drag torque are the most representative factors for evaluating braking performance. In particular, as the two performance factors have a conflicting relationship with each other, a multidisciplinary design optimization (MDO) approach is required for brake design. However, the computational cost of MDO increases as the number of disciplines increases. Recent studies on inverse design that use deep learning (DL) have established the possibility of instantly generating an optimal design that can satisfy the target performance without implementing an iterative optimization process. This study proposes a DL-based multidisciplinary inverse design (MID) that simultaneously satisfies multiple targets, such as the APT and drag torque of the brake system. Results show that the proposed inverse design can find the optimal design more efficiently compared with the conventional optimization methods, such as backpropagation and sequential quadratic programming. The MID achieved a similar performance to the single-disciplinary inverse design in terms of accuracy and computational cost. A novel design was derived on the basis of results, and the same performance was satisfied as that of the existing design.  ( 2 min )
    Quantum Algorithms for solving Hard Constrained Optimisation Problems. (arXiv:2202.13125v1 [quant-ph])
    The thesis deals with Quantum Algorithms for solving Hard Constrained Optimization Problems. It shows how quantum computers can solve difficult everyday problems such as finding the best schedule for social workers or the path of a robot picking and batching in a warehouse. The path to the solution has led to the definition of a new artificial intelligence paradigm with quantum computing, quantum Case-Based Reasoning (qCBR) and to a proof of concept to integrate the capacity of quantum computing within mobile robotics using a Raspberry Pi 4 as a processor (qRobot), capable of operating with leading technology players such as IBMQ, Amazon Braket (D-Wave) and Pennylane. To improve the execution time of variational algorithms in this NISQ era and the next, we have proposed EVA: a quantum Exponential Value Approximation algorithm that speeds up the VQE, and that is, to date, the flagship of the quantum computation. To improve the execution time of variational algorithms in this NISQ era and the next, we have proposed EVA: a quantum Exponential Value Approximation algorithm that speeds up the VQE, and that is, to date, the flagship of the quantum computation.  ( 2 min )
    Bayesian Robust Tensor Ring Model for Incomplete Multiway Data. (arXiv:2202.13321v1 [cs.LG])
    Low-rank tensor completion aims to recover missing entries from the observed data. However, the observed data may be disturbed by noise and outliers. Therefore, robust tensor completion (RTC) is proposed to solve this problem. The recently proposed tensor ring (TR) structure is applied to RTC due to its superior abilities in dealing with high-dimensional data with predesigned TR rank. To avoid manual rank selection and achieve a balance between low-rank component and sparse component, in this paper, we propose a Bayesian robust tensor ring (BRTR) decomposition method for RTC problem. Furthermore, we develop a variational Bayesian (VB) algorithm to infer the probability distribution of posteriors. During the learning process, the frontal slices of previous tensor and horizontal slices of latter tensor shared with the same TR rank with zero components are pruned, resulting in automatic rank determination. Compared with existing methods, BRTR can automatically learn TR rank without manual fine-tuning of parameters. Extensive experiments indicate that BRTR has better recovery performance and ability to remove noise than other state-of-the-art methods.  ( 2 min )
    Architectural Optimization and Feature Learning for High-Dimensional Time Series Datasets. (arXiv:2202.13486v1 [cs.LG])
    As our ability to sense increases, we are experiencing a transition from data-poor problems, in which the central issue is a lack of relevant data, to data-rich problems, in which the central issue is to identify a few relevant features in a sea of observations. Motivated by applications in gravitational-wave astrophysics, we study the problem of predicting the presence of transient noise artifacts in a gravitational wave detector from a rich collection of measurements from the detector and its environment. We argue that feature learning--in which relevant features are optimized from data--is critical to achieving high accuracy. We introduce models that reduce the error rate by over 60\% compared to the previous state of the art, which used fixed, hand-crafted features. Feature learning is useful not only because it improves performance on prediction tasks; the results provide valuable information about patterns associated with phenomena of interest that would otherwise be undiscoverable. In our application, features found to be associated with transient noise provide diagnostic information about its origin and suggest mitigation strategies. Learning in high-dimensional settings is challenging. Through experiments with a variety of architectures, we identify two key factors in successful models: sparsity, for selecting relevant variables within the high-dimensional observations; and depth, which confers flexibility for handling complex interactions and robustness with respect to temporal variations. We illustrate their significance through systematic experiments on real detector data. Our results provide experimental corroboration of common assumptions in the machine-learning community and have direct applicability to improving our ability to sense gravitational waves, as well as to many other problem settings with similarly high-dimensional, noisy, or partly irrelevant data.  ( 2 min )
    Conditional Simulation Using Diffusion Schr\"odinger Bridges. (arXiv:2202.13460v1 [stat.ML])
    Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems such as image inpainting or deblurring. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schr\"odinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend here the Schr\"odinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution and optimal filtering for state-space models.  ( 2 min )
    Supervising Remote Sensing Change Detection Models with 3D Surface Semantics. (arXiv:2202.13251v1 [cs.CV])
    Remote sensing change detection, identifying changes between scenes of the same location, is an active area of research with a broad range of applications. Recent advances in multimodal self-supervised pretraining have resulted in state-of-the-art methods which surpass vision models trained solely on optical imagery. In the remote sensing field, there is a wealth of overlapping 2D and 3D modalities which can be exploited to supervise representation learning in vision models. In this paper we propose Contrastive Surface-Image Pretraining (CSIP) for joint learning using optical RGB and above ground level (AGL) map pairs. We then evaluate these pretrained models on several building segmentation and change detection datasets to show that our method does, in fact, extract features relevant to downstream applications where natural and artificial surface information is relevant.  ( 2 min )
    Federated Online Sparse Decision Making. (arXiv:2202.13448v1 [stat.ML])
    This paper presents a novel federated linear contextual bandits model, where individual clients face different K-armed stochastic bandits with high-dimensional decision context and coupled through common global parameters. By leveraging the sparsity structure of the linear reward , a collaborative algorithm named \texttt{Fedego Lasso} is proposed to cope with the heterogeneity across clients without exchanging local decision context vectors or raw reward data. \texttt{Fedego Lasso} relies on a novel multi-client teamwork-selfish bandit policy design, and achieves near-optimal regrets for shared parameter cases with logarithmic communication costs. In addition, a new conceptual tool called federated-egocentric policies is introduced to delineate exploration-exploitation trade-off. Experiments demonstrate the effectiveness of the proposed algorithms on both synthetic and real-world datasets.  ( 2 min )
    DAGAM: A Domain Adversarial Graph Attention Model for Subject Independent EEG-Based Emotion Recognition. (arXiv:2202.12948v1 [eess.SP])
    One of the most significant challenges of EEG-based emotion recognition is the cross-subject EEG variations, leading to poor performance and generalizability. This paper proposes a novel EEG-based emotion recognition model called the domain adversarial graph attention model (DAGAM). The basic idea is to generate a graph to model multichannel EEG signals using biological topology. Graph theory can topologically describe and analyze relationships and mutual dependency between channels of EEG. Then, unlike other graph convolutional networks, self-attention pooling is applied to benefit salient EEG feature extraction from the graph, which effectively improves the performance. Finally, after graph pooling, the domain adversarial based on the graph is employed to identify and handle EEG variation across subjects, efficiently reaching good generalizability. We conduct extensive evaluations on two benchmark datasets (SEED and SEED IV) and obtain state-of-the-art results in subject-independent emotion recognition. Our model boosts the SEED accuracy to 92.59% (4.69% improvement) with the lowest standard deviation of 3.21% (2.92% decrements) and SEED IV accuracy to 80.74% (6.90% improvement) with the lowest standard deviation of 4.14% (3.88% decrements) respectively.  ( 2 min )
    Variational Inference with Gaussian Mixture by Entropy Approximation. (arXiv:2202.13059v1 [stat.ML])
    Variational inference is a technique for approximating intractable posterior distributions in order to quantify the uncertainty of machine learning. Although the unimodal Gaussian distribution is usually chosen as a parametric distribution, it hardly approximates the multimodality. In this paper, we employ the Gaussian mixture distribution as a parametric distribution. A main difficulty of variational inference with the Gaussian mixture is how to approximate the entropy of the Gaussian mixture. We approximate the entropy of the Gaussian mixture as the sum of the entropy of the unimodal Gaussian, which can be analytically calculated. In addition, we theoretically analyze the approximation error between the true entropy and approximated one in order to reveal when our approximation works well. Specifically, the approximation error is controlled by the ratios of the distances between the means to the sum of the variances of the Gaussian mixture, and it converges to zero when the ratios go to infinity. This situation seems to be more likely to occur in higher dimensional weight spaces because of the curse of dimensionality. Therefore, our result guarantees that our approximation works well, for example, in neural networks that assume a large number of weights.  ( 2 min )
    Assessing the State of Self-Supervised Human Activity Recognition using Wearables. (arXiv:2202.12938v1 [eess.SP])
    The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems from only small amounts of labeled training samples. Furthermore, self-supervised methods enable a host of new application domains such as, for example, domain adaptation and transfer across sensor positions, activities etc. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone the classic activity recognition chain with hand-crafted features of sensor data. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, and utilize it to assess state-of-the-art self-supervised learning methods in a large empirical study on a curated set of nine diverse benchmarks. This exploration leads us to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios. Based on our findings we call upon the community to join our efforts and to contribute towards shaping the evaluation of the ongoing paradigm change in modeling human activities from body-worn sensor data.  ( 2 min )
    Incremental Inference on Higher-Order Probabilistic Graphical Models Applied to Constraint Satisfaction Problems. (arXiv:2202.12916v1 [cs.LG])
    Probabilistic graphical models (PGMs) are tools for solving complex probabilistic relationships. However, suboptimal PGM structures are primarily used in practice. This dissertation presents three contributions to the PGM literature. The first is a comparison between factor graphs and cluster graphs on graph colouring problems such as Sudokus - indicating a significant advantage for preferring cluster graphs. The second is an application of cluster graphs to a practical problem in cartography: land cover classification boosting. The third is a PGMs formulation for constraint satisfaction problems and an algorithm called purge-and-merge to solve such problems too complex for traditional PGMs.  ( 2 min )
    PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers. (arXiv:2202.13481v1 [cs.DC])
    In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under low-batch inference scenarios. To address such limitation, NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server. Our first proposition is a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server's deployment. Furthermore, we co-design an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server which effectively balances low latency and high GPU utilization.  ( 2 min )
    Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons. (arXiv:2202.13163v1 [stat.ML])
    We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain unknown. The aim of this paper is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings.  ( 2 min )
    An Evaluation of the EEG alpha-to-theta and theta-to-alpha band Ratios as Indexes of Mental Workload. (arXiv:2202.12937v1 [eess.SP])
    Many research works indicate that EEG bands, specifically the alpha and theta bands, have been potentially helpful cognitive load indicators. However, minimal research exists to validate this claim. This study aims to assess and analyze the impact of the alpha-to-theta and the theta-to-alpha band ratios on supporting the creation of models capable of discriminating self-reported perceptions of mental workload. A dataset of raw EEG data was utilized in which 48 subjects performed a resting activity and an induced task demanding exercise in the form of a multitasking SIMKAP test. Band ratios were devised from frontal and parietal electrode clusters. Building and model testing was done with high-level independent features from the frequency and temporal domains extracted from the computed ratios over time. Target features for model training were extracted from the subjective ratings collected after resting and task demand activities. Models were built by employing Logistic Regression, Support Vector Machines and Decision Trees and were evaluated with performance measures including accuracy, recall, precision and f1-score. The results indicate high classification accuracy of those models trained with the high-level features extracted from the alpha-to-theta ratios and theta-to-alpha ratios. Preliminary results also show that models trained with logistic regression and support vector machines can accurately classify self-reported perceptions of mental workload. This research contributes to the body of knowledge by demonstrating the richness of the information in the temporal, spectral and statistical domains extracted from the alpha-to-theta and theta-to-alpha EEG band ratios for the discrimination of self-reported perceptions of mental workload.  ( 2 min )
    Pix2NeRF: Unsupervised Conditional $\pi$-GAN for Single Image to Neural Radiance Fields Translation. (arXiv:2202.13162v1 [cs.CV])
    We propose a pipeline to generate Neural Radiance Fields~(NeRF) of an object or a scene of a specific class, conditioned on a single input image. This is a challenging task, as training NeRF requires multiple views of the same scene, coupled with corresponding poses, which are hard to obtain. Our method is based on $\pi$-GAN, a generative model for unconditional 3D-aware image synthesis, which maps random latent codes to radiance fields of a class of objects. We jointly optimize (1) the $\pi$-GAN objective to utilize its high-fidelity 3D-aware generation and (2) a carefully designed reconstruction objective. The latter includes an encoder coupled with $\pi$-GAN generator to form an auto-encoder. Unlike previous few-shot NeRF approaches, our pipeline is unsupervised, capable of being trained with independent images without 3D, multi-view, or pose supervision. Applications of our pipeline include 3d avatar generation, object-centric novel view synthesis with a single input image, and 3d-aware super-resolution, to name a few.  ( 2 min )
    Near Optimal Reconstruction of Spherical Harmonic Expansions. (arXiv:2202.12995v1 [math.NA])
    We propose an algorithm for robust recovery of the spherical harmonic expansion of functions defined on the d-dimensional unit sphere $\mathbb{S}^{d-1}$ using a near-optimal number of function evaluations. We show that for any $f \in L^2(\mathbb{S}^{d-1})$, the number of evaluations of $f$ needed to recover its degree-$q$ spherical harmonic expansion equals the dimension of the space of spherical harmonics of degree at most $q$ up to a logarithmic factor. Moreover, we develop a simple yet efficient algorithm to recover degree-$q$ expansion of $f$ by only evaluating the function on uniformly sampled points on $\mathbb{S}^{d-1}$. Our algorithm is based on the connections between spherical harmonics and Gegenbauer polynomials and leverage score sampling methods. Unlike the prior results on fast spherical harmonic transform, our proposed algorithm works efficiently using a nearly optimal number of samples in any dimension d. We further illustrate the empirical performance of our algorithm on numerical examples.  ( 2 min )
    Causal Domain Adaptation with Copula Entropy based Conditional Independence Test. (arXiv:2202.13482v1 [cs.LG])
    Domain Adaptation (DA) is a typical problem in machine learning that aims to transfer the model trained on source domain to target domain with different distribution. Causal DA is a special case of DA that solves the problem from the view of causality. It embeds the probabilistic relationships in multiple domains in a larger causal structure network of a system and tries to find the causal source (or intervention) on the system as the reason of distribution drifts of the system states across domains. In this sense, causal DA is transformed as a causal discovery problem that finds invariant representation across domains through the conditional independence between the state variables and observable state of the system given interventions. Testing conditional independence is the corner stone of causal discovery. Recently, a copula entropy based conditional independence test was proposed with a rigorous theory and a non-parametric estimation method. In this paper, we first present a mathemetical model for causal DA problem and then propose a method for causal DA that finds the invariant representation across domains with the copula entropy based conditional independence test. The effectiveness of the method is verified on two simulated data. The power of the proposed method is then demonstrated on two real-world data: adult census income data and gait characteristics data.  ( 2 min )
    Adversarial robustness of sparse local Lipschitz predictors. (arXiv:2202.13216v1 [cs.LG])
    This work studies the adversarial robustness of parametric functions composed of a linear predictor and a non-linear representation map. Our analysis relies on sparse local Lipschitzness (SLL), an extension of local Lipschitz continuity that better captures the stability and reduced effective dimensionality of predictors upon local perturbations. SLL functions preserve a certain degree of structure, given by the sparsity pattern in the representation map, and include several popular hypothesis classes, such as piece-wise linear models, Lasso and its variants, and deep feed-forward ReLU networks. We provide a tighter robustness certificate on the minimal energy of an adversarial example, as well as tighter data-dependent non-uniform bounds on the robust generalization error of these predictors. We instantiate these results for the case of deep neural networks and provide numerical evidence that supports our results, shedding new insights into natural regularization strategies to increase the robustness of these models.  ( 2 min )
  • Open

    Flexible variable selection in the presence of missing data. (arXiv:2202.12989v1 [stat.ME])
    In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model (e.g., a generalized or penalized linear model). In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose several nonparametric variable selection algorithms combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithms that achieve control of commonly used error rates. Through simulations, we show that our proposals have good operating characteristics and result in panels with higher classification performance compared to several existing penalized regression approaches. Finally, we use the proposed methods to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.  ( 2 min )
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v3 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.  ( 2 min )
    Combating the Instability of Mutual Information-based Losses via Regularization. (arXiv:2011.07932v2 [cs.LG] UPDATED)
    Notable progress has been made in numerous fields of machine learning based on neural network-driven mutual information (MI) bounds. However, utilizing the conventional MI-based losses is often challenging due to their practical and mathematical limitations. In this work, we first identify the symptoms behind their instability: (1) the neural network not converging even after the loss seemed to converge, and (2) saturating neural network outputs causing the loss to diverge. We mitigate both issues by adding a novel regularization term to the existing losses. We theoretically and experimentally demonstrate that added regularization stabilizes training. Finally, we present a novel benchmark that evaluates MI-based losses on both the MI estimation power and its capability on the downstream tasks, closely following the pre-existing supervised and contrastive learning settings. We evaluate six different MI-based losses and their regularized counterparts on multiple benchmarks to show that our approach is simple yet effective.  ( 2 min )
    Bayesian Structure Learning with Generative Flow Networks. (arXiv:2202.13903v1 [cs.LG])
    In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.  ( 2 min )
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v2 [math.MG] UPDATED)
    We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.  ( 2 min )
    Convergence of a robust deep FBSDE method for stochastic control. (arXiv:2201.06854v3 [math.OC] UPDATED)
    In this paper, we propose a deep learning based numerical scheme for strongly coupled FBSDEs, stemming from stochastic control. It is a modification of the deep BSDE method in which the initial value to the backward equation is not a free parameter, and with a new loss function being the weighted sum of the cost of the control problem, and a variance term which coincides with the mean squared error in the terminal condition. We show by a numerical example that a direct extension of the classical deep BSDE method to FBSDEs, fails for a simple linear-quadratic control problem, and motivate why the new method works. Under regularity and boundedness assumptions on the exact controls of time continuous and time discrete control problems, we provide an error analysis for our method. We show empirically that the method converges for three different problems, one being the one that failed for a direct extension of the deep BSDE method.  ( 2 min )
    How and what to learn:The modes of machine learning. (arXiv:2202.13829v1 [cs.LG])
    We proposal a new approach, namely the weight pathway analysis (WPA), to study the mechanism of multilayer neural networks. The weight pathways linking neurons longitudinally from input neurons to output neurons are considered as the basic units of a neural network. We decompose a neural network into a series of subnetworks of weight pathways, and establish characteristic maps for these subnetworks. The parameters of a characteristic map can be visualized, providing a longitudinal perspective of the network and making the neural network explainable. Using WPA, we discover that a neural network stores and utilizes information in a "holographic" way, that is, the network encodes all training samples in a coherent structure. An input vector interacts with this "holographic" structure to enhance or suppress each subnetwork which working together to produce the correct activities in the output neurons to recognize the input sample. Furthermore, with WPA, we reveal fundamental learning modes of a neural network: the linear learning mode and the nonlinear learning mode. The former extracts linearly separable features while the latter extracts linearly inseparable features. It is found that hidden-layer neurons self-organize into different classes in the later stages of the learning process. It is further discovered that the key strategy to improve the performance of a neural network is to control the ratio of the two learning modes to match that of the linear and the nonlinear features, and that increasing the width or the depth of a neural network helps this ratio controlling process. This provides theoretical ground for the practice of optimizing a neural network via increasing its width or its depth. The knowledge gained with WPA enables us to understand the fundamental questions such as what to learn, how to learn, and how can learn well.  ( 2 min )
    Semi-supervised Learning on Large Graphs: is Poisson Learning a Game-Changer?. (arXiv:2202.13608v1 [stat.ML])
    We explain Poisson learning on graph-based semi-supervised learning to see if it could avoid the problem of global information loss problem as Laplace-based learning methods on large graphs. From our analysis, Poisson learning is simply Laplace regularization with thresholding, cannot overcome the problem.  ( 2 min )
    Towards Return Parity in Markov Decision Processes. (arXiv:2111.10476v2 [cs.LG] UPDATED)
    Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs sharing the same state and action spaces into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets.  ( 2 min )
    Variational Interpretable Learning from Multi-view Data. (arXiv:2202.13503v1 [stat.ML])
    The main idea of canonical correlation analysis (CCA) is to map different views onto a common latent space with maximum correlation. We propose a deep interpretable variational canonical correlation analysis (DICCA) for multi-view learning. The developed model extends the existing latent variable model for linear CCA to nonlinear models through the use of deep generative networks. DICCA is designed to disentangle both the shared and view-specific variations for multi-view data. To further make the model more interpretable, we place a sparsity-inducing prior on the latent weight with a structured variational autoencoder that is comprised of view-specific generators. Empirical results on real-world datasets show that our methods are competitive across domains.  ( 2 min )
    Learning to Estimate Without Bias. (arXiv:2110.12403v2 [cs.LG] UPDATED)
    The Gauss Markov theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non-linear settings via deep learning with bias constraints. The classical approach to designing non-linear MVUEs is through maximum likelihood estimation (MLE) which often involves real-time computationally challenging optimizations. On the other hand, deep learning methods allow for non-linear estimators with fixed computational complexity. Learning based estimators perform optimally on average with respect to their training set but may suffer from significant bias in other parameters. To avoid this, we propose to add a simple bias constraint to the loss function, resulting in an estimator we refer to as Bias Constrained Estimator (BCE). We prove that this yields asymptotic MVUEs that behave similarly to the classical MLEs and asymptotically attain the Cramer Rao bound. We demonstrate the advantages of our approach in the context of signal to noise ratio estimation as well as covariance estimation. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance. Examples include distributed sensor networks and data augmentation in test-time. In such applications, we show that BCE leads to asymptotically consistent estimators.  ( 2 min )
    Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization. (arXiv:2106.06916v2 [cs.LG] UPDATED)
    As Artificial Intelligence as a Service gains popularity, protecting well-trained models as intellectual property is becoming increasingly important. There are two common types of protection methods: ownership verification and usage authorization. In this paper, we propose Non-Transferable Learning (NTL), a novel approach that captures the exclusive data representation in the learned model and restricts the model generalization ability to certain domains. This approach provides effective solutions to both model verification and authorization. Specifically: 1) For ownership verification, watermarking techniques are commonly used but are often vulnerable to sophisticated watermark removal methods. By comparison, our NTL-based ownership verification provides robust resistance to state-of-the-art watermark removal methods, as shown in extensive experiments with 6 removal approaches over the digits, CIFAR10 & STL10, and VisDA datasets. 2) For usage authorization, prior solutions focus on authorizing specific users to access the model, but authorized users can still apply the model to any data without restriction. Our NTL-based authorization approach instead provides data-centric protection, which we call applicability authorization, by significantly degrading the performance of the model on unauthorized data. Its effectiveness is also shown through experiments on the aforementioned datasets.  ( 2 min )
    Robust Training under Label Noise by Over-parameterization. (arXiv:2202.14026v1 [cs.LG])
    Recently, over-parameterized deep networks, with increasingly more network parameters than training samples, have dominated the performances of modern machine learning. However, when the training data is corrupted, it has been well-known that over-parameterized networks tend to overfit and do not generalize. In this work, we propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted. The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data. Specifically, we model the label noise via another sparse over-parameterization term, and exploit implicit algorithmic regularizations to recover and separate the underlying corruptions. Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets. Furthermore, our experimental results are corroborated by theory on simplified linear models, showing that exact separation between sparse noise and low-rank data can be achieved under incoherent conditions. The work opens many interesting directions for improving over-parameterized models by using sparse over-parameterization and implicit regularization.  ( 2 min )
    Robust and Accurate Authorship Attribution via Program Normalization. (arXiv:2007.00772v3 [cs.LG] UPDATED)
    Source code attribution approaches have achieved remarkable accuracy thanks to the rapid advances in deep learning. However, recent studies shed light on their vulnerability to adversarial attacks. In particular, they can be easily deceived by adversaries who attempt to either create a forgery of another author or to mask the original author. To address these emerging issues, we formulate this security challenge into a general threat model, the $\textit{relational adversary}$, that allows an arbitrary number of the semantics-preserving transformations to be applied to an input in any problem space. Our theoretical investigation shows the conditions for robustness and the trade-off between robustness and accuracy in depth. Motivated by these insights, we present a novel learning framework, $\textit{normalize-and-predict}$ ($\textit{N&P}$), that in theory guarantees the robustness of any authorship-attribution approach. We conduct an extensive evaluation of $\textit{N&P}$ in defending two of the latest authorship-attribution approaches against state-of-the-art attack methods. Our evaluation demonstrates that $\textit{N&P}$ improves the accuracy on adversarial inputs by as much as 70% over the vanilla models. More importantly, $\textit{N&P}$ also increases robust accuracy to 45% higher than adversarial training while running over 40 times faster.  ( 2 min )
    Sequential Deconfounding for Causal Inference with Unobserved Confounders. (arXiv:2104.09323v3 [stat.ME] UPDATED)
    Using observational data to estimate the effect of a treatment is a powerful tool for decision-making when randomized experiments are infeasible or costly. However, observational data often yields biased estimates of treatment effects, since treatment assignment can be confounded by unobserved variables. A remedy is offered by deconfounding methods that adjust for such unobserved confounders. In this paper, we develop the Sequential Deconfounder, a method that enables estimating individualized treatment effects over time in presence of unobserved confounders. This is the first deconfounding method that can be used in a general sequential setting (i.e., with one or more treatments assigned at each timestep). The Sequential Deconfounder uses a novel Gaussian process latent variable model to infer substitutes for the unobserved confounders, which are then used in conjunction with an outcome model to estimate treatment effects over time. We prove that using our method yields unbiased estimates of individualized treatment responses over time. Using simulated and real medical data, we demonstrate the efficacy of our method in deconfounding the estimation of treatment responses over time.  ( 2 min )
    Functional mixture-of-experts for classification. (arXiv:2202.13934v1 [stat.ML])
    We develop a mixtures-of-experts (ME) approach to the multiclass classification where the predictors are univariate functions. It consists of a ME model in which both the gating network and the experts network are constructed upon multinomial logistic activation functions with functional inputs. We perform a regularized maximum likelihood estimation in which the coefficient functions enjoy interpretable sparsity constraints on targeted derivatives. We develop an EM-Lasso like algorithm to compute the regularized MLE and evaluate the proposed approach on simulated and real data.  ( 2 min )
    Theoretical bounds on data requirements for the ray-based classification. (arXiv:2103.09577v3 [cs.LG] UPDATED)
    The problem of classifying high-dimensional shapes in real-world data grows in complexity as the dimension of the space increases. For the case of identifying convex shapes of different geometries, a new classification framework has recently been proposed in which the intersections of a set of one-dimensional representations, called rays, with the boundaries of the shape are used to identify the specific geometry. This ray-based classification (RBC) has been empirically verified using a synthetic dataset of two- and three-dimensional shapes (Zwolak et al. in Proceedings of Third Workshop on Machine Learning and the Physical Sciences (NeurIPS 2020), Vancouver, Canada [December 11, 2020], arXiv:2010.00500, 2020) and, more recently, has also been validated experimentally (Zwolak et al., PRX Quantum 2:020335, 2021). Here, we establish a bound on the number of rays necessary for shape classification, defined by key angular metrics, for arbitrary convex shapes. For two dimensions, we derive a lower bound on the number of rays in terms of the shape's length, diameter, and exterior angles. For convex polytopes in $\mathbb{R}^N$, we generalize this result to a similar bound given as a function of the dihedral angle and the geometrical parameters of polygonal faces. This result enables a different approach for estimating high-dimensional shapes using substantially fewer data elements than volumetric or surface-based approaches.  ( 2 min )
    Bayesian Active Learning for Discrete Latent Variable Models. (arXiv:2202.13426v1 [cs.LG])
    Active learning seeks to reduce the number of samples required to estimate the parameters of a model, thus forming an important class of techniques in modern machine learning. However, past work on active learning has largely overlooked latent variable models, which play a vital role in neuroscience, psychology, and a variety of other engineering and scientific disciplines. Here we address this gap in the literature and propose a novel framework for maximum-mutual-information input selection for learning discrete latent variable regression models. We first examine a class of models known as "mixtures of linear regressions" (MLR). This example is striking because it is well known that active learning confers no advantage for standard least-squares regression. However, we show -- both in simulations and analytically using Fisher information -- that optimal input selection can nevertheless provide dramatic gains for mixtures of regression models; we also validate this on a real-world application of MLRs. We then consider a powerful class of temporally structured latent variable models known as Input-Output Hidden Markov Models (IO-HMMs), which have recently gained prominence in neuroscience. We show that our method substantially speeds up learning, and outperforms a variety of approximate methods based on variational and amortized inference.  ( 2 min )
    Surrogate Model For Field Optimization Using Beta-VAE Based Regression. (arXiv:2008.11433v2 [cs.LG] UPDATED)
    Oilfield development related decisions are made using reservoir simulation-based optimization study in which different production scenarios and well controls are compared. Such simulations are computationally expensive and so surrogate models are used to accelerate studies. Deep learning has been used in past to generate surrogates, but such models often fail to quantify prediction uncertainty and are not interpretable. In this work, beta-VAE based regression is proposed to generate simulation surrogates for use in optimization workflow. beta-VAE enables interpretable, factorized representation of decision variables in latent space, which is then further used for regression. Probabilistic dense layers are used to quantify prediction uncertainty and enable approximate Bayesian inference. Surrogate model developed using beta-VAE based regression finds interpretable and relevant latent representation. A reasonable value of beta ensures a good balance between factor disentanglement and reconstruction. Probabilistic dense layer helps in quantifying predicted uncertainty for objective function, which is then used to decide whether full-physics simulation is required for a case.  ( 2 min )
    PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation. (arXiv:2106.00058v3 [cs.LG] UPDATED)
    The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and rich literature has studied its theoretical convergence. The growing popularity of neurally plausible unfolded sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. This paper offers the theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method. We provide sufficient conditions on the network initialization and data distribution for model recovery. We highlight the impact of loss, unfolding, and backpropagation on convergence. We discuss proper initializations and strategies to resolve the instability issue concerning backpropagated gradient. This is achieved through tuning network parameters and modification of loss function. We complement our findings through synthetic and image denoising experiments. Finally, we highlight the interpretability of PULDE by deriving a mathematical relation between network weights, its output, and the training data.  ( 2 min )
    Learning Continuous Exponential Families Beyond Gaussian. (arXiv:2102.09198v2 [cs.LG] UPDATED)
    We address the problem of learning of continuous exponential family distributions with unbounded support. While a lot of progress has been made on learning of Gaussian graphical models, we still lack scalable algorithms for reconstructing general continuous exponential families modeling higher-order moments of the data beyond the mean and the covariance. Here, we introduce a computationally efficient method for learning continuous graphical models based on the Interaction Screening approach. Through a series of numerical experiments, we show that our estimator maintains similar requirements in terms of accuracy and sample complexity scalings compared to alternative approaches such as maximization of conditional likelihood, while considerably improving upon the algorithm's run-time.  ( 2 min )
    Generalised Gaussian Process Latent Variable Models (GPLVM) with Stochastic Variational Inference. (arXiv:2202.12979v1 [cs.LG])
    Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM Titsias and Lawrence, 2010] uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorized Gaussian yielding a tractable lower bound. However, the non-factories ability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model's performance by benchmarking against the canonical sparse GPLVM for high-dimensional data examples.  ( 2 min )
    On the Power and Limitations of Random Features for Understanding Neural Networks. (arXiv:1904.00687v4 [cs.LG] UPDATED)
    Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks.  ( 2 min )
    Learning a Single Neuron with Gradient Methods. (arXiv:2001.05205v3 [cs.LG] UPDATED)
    We consider the fundamental problem of learning a single neuron $x \mapsto\sigma(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $\sigma(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results.  ( 2 min )
    Learning stochastic object models from medical imaging measurements by use of advanced ambient generative adversarial networks. (arXiv:2106.14324v2 [eess.IV] UPDATED)
    Purpose: To objectively assess new medical imaging technologies via computer-simulations, it is important to account for the variability in the ensemble of objects to be imaged. This source of variability can be described by stochastic object models (SOMs). It is generally desirable to establish SOMs from experimental imaging measurements acquired by use of a well-characterized imaging system, but this task has remained challenging. Approach: A generative adversarial network (GAN)-based method that employs AmbientGANs with modern progressive or multiresolution training approaches is proposed. AmbientGANs established using the proposed training procedure are systematically validated in a controlled way using computer-simulated magnetic resonance imaging (MRI) data corresponding to a stylized imaging system. Emulated single-coil experimental MRI data are also employed to demonstrate the methods under less stylized conditions. Results: The proposed AmbientGAN method can generate clean images when the imaging measurements are contaminated by measurement noise. When the imaging measurement data are incomplete, the proposed AmbientGAN can reliably learn the distribution of the measurement components of the objects. Conclusions: Both visual examinations and quantitative analyses, including task-specific validations using the Hotelling observer, demonstrated that the proposed AmbientGAN method holds promise to establish realistic SOMs from imaging measurements.  ( 2 min )
    Conditional Simulation Using Diffusion Schr\"odinger Bridges. (arXiv:2202.13460v1 [stat.ML])
    Denoising diffusion models have recently emerged as a powerful class of generative models. They provide state-of-the-art results, not only for unconditional simulation, but also when used to solve conditional simulation problems arising in a wide range of inverse problems such as image inpainting or deblurring. A limitation of these models is that they are computationally intensive at generation time as they require simulating a diffusion process over a long time horizon. When performing unconditional simulation, a Schr\"odinger bridge formulation of generative modeling leads to a theoretically grounded algorithm shortening generation time which is complementary to other proposed acceleration techniques. We extend here the Schr\"odinger bridge framework to conditional simulation. We demonstrate this novel methodology on various applications including image super-resolution and optimal filtering for state-space models.  ( 2 min )
    Variational Inference with Gaussian Mixture by Entropy Approximation. (arXiv:2202.13059v1 [stat.ML])
    Variational inference is a technique for approximating intractable posterior distributions in order to quantify the uncertainty of machine learning. Although the unimodal Gaussian distribution is usually chosen as a parametric distribution, it hardly approximates the multimodality. In this paper, we employ the Gaussian mixture distribution as a parametric distribution. A main difficulty of variational inference with the Gaussian mixture is how to approximate the entropy of the Gaussian mixture. We approximate the entropy of the Gaussian mixture as the sum of the entropy of the unimodal Gaussian, which can be analytically calculated. In addition, we theoretically analyze the approximation error between the true entropy and approximated one in order to reveal when our approximation works well. Specifically, the approximation error is controlled by the ratios of the distances between the means to the sum of the variances of the Gaussian mixture, and it converges to zero when the ratios go to infinity. This situation seems to be more likely to occur in higher dimensional weight spaces because of the curse of dimensionality. Therefore, our result guarantees that our approximation works well, for example, in neural networks that assume a large number of weights.  ( 2 min )
    Resolving label uncertainty with implicit posterior models. (arXiv:2202.14000v1 [cs.LG])
    We propose a method for jointly inferring labels across a collection of data samples, where each sample consists of an observation and a prior belief about the label. By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs. This formulation unifies various machine learning settings; the weak beliefs can come in the form of noisy or incomplete labels, likelihoods given by a different prediction mechanism on auxiliary input, or common-sense priors reflecting knowledge about the structure of the problem at hand. We demonstrate the proposed algorithms on diverse problems: classification with negative training examples, learning from rankings, weakly and self-supervised aerial imagery segmentation, co-segmentation of video frames, and coarsely supervised text classification.  ( 2 min )
    Ray-based classification framework for high-dimensional data. (arXiv:2010.00500v2 [cs.LG] UPDATED)
    While classification of arbitrary structures in high dimensions may require complete quantitative information, for simple geometrical structures, low-dimensional qualitative information about the boundaries defining the structures can suffice. Rather than using dense, multi-dimensional data, we propose a deep neural network (DNN) classification framework that utilizes a minimal collection of one-dimensional representations, called \emph{rays}, to construct the "fingerprint" of the structure(s) based on substantially reduced information. We empirically study this framework using a synthetic dataset of double and triple quantum dot devices and apply it to the classification problem of identifying the device state. We show that the performance of the ray-based classifier is already on par with traditional 2D images for low dimensional systems, while significantly cutting down the data acquisition cost.  ( 2 min )
    Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games. (arXiv:2102.08903v2 [cs.LG] UPDATED)
    Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces. However, it remains elusive how to obtain optimization and statistical guarantees for such algorithms. We present a new policy optimization algorithm with function approximation and prove that under standard regularity conditions on the Markov game and the function approximation class, our algorithm finds a near-optimal policy within a polynomial number of samples and iterations. To our knowledge, this is the first provably efficient policy optimization algorithm with function approximation that solves two-player zero-sum Markov games.  ( 2 min )
    Bayesian Robust Tensor Ring Model for Incomplete Multiway Data. (arXiv:2202.13321v1 [cs.LG])
    Low-rank tensor completion aims to recover missing entries from the observed data. However, the observed data may be disturbed by noise and outliers. Therefore, robust tensor completion (RTC) is proposed to solve this problem. The recently proposed tensor ring (TR) structure is applied to RTC due to its superior abilities in dealing with high-dimensional data with predesigned TR rank. To avoid manual rank selection and achieve a balance between low-rank component and sparse component, in this paper, we propose a Bayesian robust tensor ring (BRTR) decomposition method for RTC problem. Furthermore, we develop a variational Bayesian (VB) algorithm to infer the probability distribution of posteriors. During the learning process, the frontal slices of previous tensor and horizontal slices of latter tensor shared with the same TR rank with zero components are pruned, resulting in automatic rank determination. Compared with existing methods, BRTR can automatically learn TR rank without manual fine-tuning of parameters. Extensive experiments indicate that BRTR has better recovery performance and ability to remove noise than other state-of-the-art methods.  ( 2 min )
    A Proximal Algorithm for Sampling. (arXiv:2202.13975v1 [cs.LG])
    We consider sampling problems with possibly non-smooth potentials (negative log-densities). In particular, we study two specific settings of sampling where the convex potential is either semi-smooth or in composite form as the sum of a smooth component and a semi-smooth component. To overcome the challenges caused by the non-smoothness, we propose a Markov chain Monte Carlo algorithm that resembles proximal methods in optimization for these sampling tasks. The key component of our method is a sampling scheme for a quadratically regularized target potential. This scheme relies on rejection sampling with a carefully designed Gaussian proposal whose center is an approximate minimizer of the regularized potential. We develop a novel technique (a modified Gaussian integral) to bound the complexity of this rejection sampling scheme in spite of the non-smoothness in the potentials. We then combine this scheme with the alternating sampling framework (ASF), which uses Gibbs sampling on an augmented distribution, to accomplish the two settings of sampling tasks we consider. Furthermore, by combining the complexity bound of the rejection sampling we develop and the remarkable convergence properties of ASF discovered recently, we are able to establish several non-asymptotic complexity bounds for our algorithm, in terms of the total number of queries of subgradient of the target potential. Our algorithm achieves state-of-the-art complexity bounds compared with all existing methods in the same settings.  ( 2 min )
    High Dimensional Statistical Estimation under One-bit Quantization. (arXiv:2202.13157v1 [stat.ML])
    Compared with data with high precision, one-bit (binary) data are preferable in many applications because of the efficiency in signal storage, processing, transmission, and enhancement of privacy. In this paper, we study three fundamental statistical estimation problems, i.e., sparse covariance matrix estimation, sparse linear regression, and low-rank matrix completion via binary data arising from an easy-to-implement one-bit quantization process that contains truncation, dithering and quantization as typical steps. Under both sub-Gaussian and heavy-tailed regimes, new estimators that handle high-dimensional scaling are proposed. In sub-Gaussian case, we show that our estimators achieve minimax rates up to logarithmic factors, hence the quantization nearly costs nothing from the perspective of statistical learning rate. In heavy-tailed case, we truncate the data before dithering to achieve a bias-variance trade-off, which results in estimators embracing convergence rates that are the square root of the corresponding minimax rates. Experimental results on synthetic data are reported to support and demonstrate the statistical properties of our estimators under one-bit quantization.  ( 2 min )
    Missing Value Knockoffs. (arXiv:2202.13054v1 [stat.ML])
    One limitation of the most statistical/machine learning-based variable selection approaches is their inability to control the false selections. A recently introduced framework, model-x knockoffs, provides that to a wide range of models but lacks support for datasets with missing values. In this work, we discuss ways of preserving the theoretical guarantees of the model-x framework in the missing data setting. First, we prove that posterior sampled imputation allows reusing existing knockoff samplers in the presence of missing values. Second, we show that sampling knockoffs only for the observed variables and applying univariate imputation also preserves the false selection guarantees. Third, for the special case of latent variable models, we demonstrate how jointly imputing and sampling knockoffs can reduce the computational complexity. We have verified the theoretical findings with two different exploratory variable distributions and investigated how the missing data pattern, amount of correlation, the number of observations, and missing values affected the statistical power.  ( 2 min )
    Sign and Basis Invariant Networks for Spectral Graph Representation Learning. (arXiv:2202.13013v1 [cs.LG])
    Many machine learning tasks involve processing eigenvectors derived from data. Especially valuable are Laplacian eigenvectors, which capture useful structural information about graphs and other geometric objects. However, ambiguities arise when computing eigenvectors: for each eigenvector $v$, the sign flipped $-v$ is also an eigenvector. More generally, higher dimensional eigenspaces contain infinitely many choices of basis eigenvectors. These ambiguities make it a challenge to process eigenvectors and eigenspaces in a consistent way. In this work we introduce SignNet and BasisNet -- new neural architectures that are invariant to all requisite symmetries and hence process collections of eigenspaces in a principled manner. Our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the proper invariances. They are also theoretically strong for graph representation learning -- they can approximate any spectral graph convolution, can compute spectral invariants that go beyond message passing neural networks, and can provably simulate previously proposed graph positional encodings. Experiments show the strength of our networks for learning spectral graph filters and learning graph positional encodings.  ( 2 min )
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v1 [stat.ML])
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Experimental results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and prediction of biologically meaningful system inputs.  ( 2 min )
    Bandit Learning with General Function Classes: Heteroscedastic Noise and Variance-dependent Regret Bounds. (arXiv:2202.13603v1 [cs.LG])
    We consider learning a stochastic bandit model, where the reward function belongs to a general class of uniformly bounded functions, and the additive noise can be heteroscedastic. Our model captures contextual linear bandits and generalized linear bandits as special cases. While previous works (Kirschner and Krause, 2018; Zhou et al., 2021) based on weighted ridge regression can deal with linear bandits with heteroscedastic noise, they are not directly applicable to our general model due to the curse of nonlinearity. In order to tackle this problem, we propose a multi-level learning framework for the general bandit model. The core idea of our framework is to partition the observed data into different levels according to the variance of their respective reward and perform online learning at each level collaboratively. Under our framework, we first design an algorithm that constructs the variance-aware confidence set based on empirical risk minimization and prove a variance-dependent regret bound. For generalized linear bandits, we further propose an algorithm based on follow-the-regularized-leader (FTRL) subroutine and online-to-confidence-set conversion, which can achieve a tighter variance-dependent regret under certain conditions.
    Regularized Bilinear Discriminant Analysis for Multivariate Time Series Data. (arXiv:2202.13188v1 [cs.LG])
    In recent years, the methods on matrix-based or bilinear discriminant analysis (BLDA) have received much attention. Despite their advantages, it has been reported that the traditional vector-based regularized LDA (RLDA) is still quite competitive and could outperform BLDA on some benchmark datasets. Nevertheless, it is also noted that this finding is mainly limited to image data. In this paper, we propose regularized BLDA (RBLDA) and further explore the comparison between RLDA and RBLDA on another type of matrix data, namely multivariate time series (MTS). Unlike image data, MTS typically consists of multiple variables measured at different time points. Although many methods for MTS data classification exist within the literature, there is relatively little work in exploring the matrix data structure of MTS data. Moreover, the existing BLDA can not be performed when one of its within-class matrices is singular. To address the two problems, we propose RBLDA for MTS data classification, where each of the two within-class matrices is regularized via one parameter. We develop an efficient implementation of RBLDA and an efficient model selection algorithm with which the cross validation procedure for RBLDA can be performed efficiently. Experiments on a number of real MTS data sets are conducted to evaluate the proposed algorithm and compare RBLDA with several closely related methods, including RLDA and BLDA. The results reveal that RBLDA achieves the best overall recognition performance and the proposed model selection algorithm is efficient; Moreover, RBLDA can produce better visualization of MTS data than RLDA.
    Joint Learning of Linear Time-Invariant Dynamical Systems. (arXiv:2112.10955v3 [stat.ML] UPDATED)
    Learning the parameters of multiple related linear time-invariant (LTI) systems is a problem of interest that remains unexplored to date. We develop a joint estimator for learning the transition matrices of multiple LTI systems that share common basis matrices. Further, we establish finite-time error bounds that reflect the influence of trajectory lengths, dimension, number of systems, and the transition matrices. The results hold under fairly general assumptions and showcase the gains from pooling data across systems, in comparison to learning individual systems separately. We also show robustness to misspecifications in the joint structure of the LTI systems. To derive the results, novel methods are developed for establishing spectral properties of transition matrices and of dependent random matrices, as well as effects of unknown deviation matrices on joint learning.  ( 2 min )
    Selection, Ignorability and Challenges With Causal Fairness. (arXiv:2202.13774v1 [stat.ML])
    In this paper we look at popular fairness methods that use causal counterfactuals. These methods capture the intuitive notion that a prediction is fair if it coincides with the prediction that would have been made if someone's race, gender or religion were counterfactually different. In order to achieve this, we must have causal models that are able to capture what someone would be like if we were to counterfactually change these traits. However, we argue that any model that can do this must lie outside the particularly well behaved class that is commonly considered in the fairness literature. This is because in fairness settings, models in this class entail a particularly strong causal assumption, normally only seen in a randomised controlled trial. We argue that in general this is unlikely to hold. Furthermore, we show in many cases it can be explicitly rejected due to the fact that samples are selected from a wider population. We show this creates difficulties for counterfactual fairness as well as for the application of more general causal fairness methods.
    Robust Classification via Support Vector Machines. (arXiv:2104.13458v2 [stat.ML] UPDATED)
    Classification models are very sensitive to data uncertainty, and finding robust classifiers that are less sensitive to data uncertainty has raised great interest in the machine learning literature. This paper aims to construct robust \emph{Support Vector Machine} classifiers under feature data uncertainty via two probabilistic arguments. The first classifier, \emph{Single Perturbation}, reduces the local effect of data uncertainty with respect to one given feature and acts as a local test that could confirm or refute the presence of significant data uncertainty for that particular feature. The second classifier, \emph{Extreme Empirical Loss}, aims to reduce the aggregate effect of data uncertainty with respect to all features, which is possible via a trade-off between the number of prediction model violations and the size of these violations. Both methodologies are computationally efficient and our extensive numerical investigation highlights the advantages and possible limitations of the two robust classifiers on synthetic and real-life data.  ( 2 min )
    Federated Online Sparse Decision Making. (arXiv:2202.13448v1 [stat.ML])
    This paper presents a novel federated linear contextual bandits model, where individual clients face different K-armed stochastic bandits with high-dimensional decision context and coupled through common global parameters. By leveraging the sparsity structure of the linear reward , a collaborative algorithm named \texttt{Fedego Lasso} is proposed to cope with the heterogeneity across clients without exchanging local decision context vectors or raw reward data. \texttt{Fedego Lasso} relies on a novel multi-client teamwork-selfish bandit policy design, and achieves near-optimal regrets for shared parameter cases with logarithmic communication costs. In addition, a new conceptual tool called federated-egocentric policies is introduced to delineate exploration-exploitation trade-off. Experiments demonstrate the effectiveness of the proposed algorithms on both synthetic and real-world datasets.  ( 2 min )
    Exploring with Sticky Mittens: Reinforcement Learning with Expert Interventions via Option Templates. (arXiv:2202.12967v1 [cs.LG])
    Environments with sparse rewards and long horizons pose a significant challenge for current reinforcement learning algorithms. A key feature enabling humans to learn challenging control tasks is that they often receive expert intervention that enables them to understand the high-level structure of the task before mastering low-level control actions. We propose a framework for leveraging expert intervention to solve long-horizon reinforcement learning tasks. We consider option templates, which are specifications encoding a potential option that can be trained using reinforcement learning. We formulate expert intervention as allowing the agent to execute option templates before learning an implementation. This enables them to use an option, before committing costly resources to learning it. We evaluate our approach on three challenging reinforcement learning problems, showing that it outperforms state of-the-art approaches by an order of magnitude. Project website at https://sites.google.com/view/stickymittens  ( 2 min )
    Risk-Neutral Market Simulation. (arXiv:2202.13996v1 [q-fin.CP])
    We develop a risk-neutral spot and equity option market simulator for a single underlying, under which the joint market process is a martingale. We leverage an efficient low-dimensional representation of the market which preserves no static arbitrage, and employ neural spline flows to simulate samples which are free from conditional drifts and are highly realistic in the sense that among all possible risk-neutral simulators, the obtained risk-neutral simulator is the closest to the historical data with respect to the Kullback-Leibler divergence. Numerical experiments demonstrate the effectiveness and highlight both drift removal and fidelity of the calibrated simulator.  ( 2 min )
    Strong Consistency for a Class of Adaptive Clustering Procedures. (arXiv:2202.13423v1 [math.ST])
    We introduce a class of clustering procedures which includes $k$-means and $k$-medians, as well as variants of these where the domain of the cluster centers can be chosen adaptively (for example, $k$-medoids) and where the number of cluster centers can be chosen adaptively (for example, according to the elbow method). In the non-parametric setting and assuming only the finiteness of certain moments, we show that all clustering procedures in this class are strongly consistent under IID samples. Our method of proof is to directly study the continuity of various deterministic maps associated with these clustering procedures, and to show that strong consistency simply descends from analogous strong consistency of the empirical measures. In the adaptive setting, our work provides a strong consistency result that is the first of its kind. In the non-adaptive setting, our work strengthens Pollard's classical result by dispensing with various unnecessary technical hypotheses, by upgrading the particular notion of strong consistency, and by using the same methods to prove further limit theorems.  ( 2 min )
    Neural Contextual Bandits without Regret. (arXiv:2107.03144v2 [stat.ML] UPDATED)
    Contextual bandits are a rich model for sequential decision making given side information, with important applications, e.g., in recommender systems. We propose novel algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function. We resolve the open problem of proving sublinear regret bounds in this setting for general context sequences, considering both fully-connected and convolutional networks. To this end, we first analyze NTK-UCB, a kernelized bandit optimization algorithm employing the Neural Tangent Kernel (NTK), and bound its regret in terms of the NTK maximum information gain $\gamma_T$, a complexity parameter capturing the difficulty of learning. Our bounds on $\gamma_T$ for the NTK may be of independent interest. We then introduce our neural network based algorithm NN-UCB, and show that its regret closely tracks that of NTK-UCB. Under broad non-parametric assumptions about the reward function, our approach converges to the optimal policy at a $\tilde{\mathcal{O}}(T^{-1/2d})$ rate, where $d$ is the dimension of the context.  ( 2 min )
    Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms. (arXiv:2202.13001v1 [cs.LG])
    We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting), and the number of tasks $N$ as well as the total number of rounds $T$ are known ($N$ could be unknown in the meta-learning setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(N\sqrt{M \tau}+N^{2/3})$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2})$ regret.  ( 2 min )
    Primal-Dual Stochastic Mirror Descent for MDPs. (arXiv:2103.00299v4 [math.OC] UPDATED)
    We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs). For this purpose, some variant of Stochastic Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. An important detail is the ability to use inexact values of functional constraints and compute the value of dual variables. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method. Using this algorithm, we get the first parallel algorithm for mixing average-reward MDPs with a generative model without reduction to discounted MDP. One of the main features of the presented method is low communication costs in a distributed centralized setting, even with very large networks.  ( 2 min )
    Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent. (arXiv:2011.01170v3 [cs.LG] UPDATED)
    Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not only loose, in that it does not capture that EM can make more progress than a gradient step, but the assumptions fail to hold for textbook examples of EM like Gaussian mixtures. In this work we first show that for the common setting of exponential family distributions, viewing EM as a mirror descent algorithm leads to convergence rates in Kullback-Leibler (KL) divergence. Then, we show how the KL divergence is related to first-order stationarity via Bregman divergences. In contrast to previous works, the analysis is invariant to the choice of parametrization and holds with minimal assumptions. We also show applications of these ideas to local linear (and superlinear) convergence rates, generalized EM, and non-exponential family distributions.  ( 2 min )
    Lower Bounds for Non-Convex Stochastic Optimization. (arXiv:1912.02365v2 [math.OC] UPDATED)
    We lower bound the complexity of finding $\epsilon$-stationary points (with gradient norm at most $\epsilon$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $\epsilon^{-4}$ queries to find an $\epsilon$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $\epsilon^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.  ( 2 min )
    Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity. (arXiv:2202.13890v1 [cs.LG])
    Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -- which do not require explicit model estimation -- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.  ( 2 min )
    Evaluating High-Order Predictive Distributions in Deep Learning. (arXiv:2202.13509v1 [stat.ML])
    Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not. We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical. To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs. We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.  ( 2 min )
    On the Benefits of Large Learning Rates for Kernel Methods. (arXiv:2202.13733v1 [stat.ML])
    This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.  ( 2 min )
    Rectified Max-Value Entropy Search for Bayesian Optimization. (arXiv:2202.13597v1 [cs.LG])
    Although the existing max-value entropy search (MES) is based on the widely celebrated notion of mutual information, its empirical performance can suffer due to two misconceptions whose implications on the exploration-exploitation trade-off are investigated in this paper. These issues are essential in the development of future acquisition functions and the improvement of the existing ones as they encourage an accurate measure of the mutual information such as the rectified MES (RMES) acquisition function we develop in this work. Unlike the evaluation of MES, we derive a closed-form probability density for the observation conditioned on the max-value and employ stochastic gradient ascent with reparameterization to efficiently optimize RMES. As a result of a more principled acquisition function, RMES shows a consistent improvement over MES in several synthetic function benchmarks and real-world optimization problems.  ( 2 min )
    The Causal Marginal Polytope for Bounding Treatment Effects. (arXiv:2202.13851v1 [stat.ML])
    Due to unmeasured confounding, it is often not possible to identify causal effects from a postulated model. Nevertheless, we can ask for partial identification, which usually boils down to finding upper and lower bounds of a causal quantity of interest derived from all solutions compatible with the encoded structural assumptions. One appealing way to derive such bounds is by casting it in terms of a constrained optimization method that searches over all causal models compatible with evidence, as introduced in the classic work of Balke and Pearl (1994) for discrete data. Although by construction this guarantees tight bounds, it poses a formidable computational challenge. To cope with this issue, alternatives include algorithms that are not guaranteed to be tight, or by introducing restrictions on the class of models. In this paper, we introduce a novel alternative: inspired by ideas coming from belief propagation, we enforce compatibility between marginals of a causal model and data, without constructing a global causal model. We call this collection of locally consistent marginals the causal marginal polytope. As global independence constraints disappear when considering small dimensional tractable marginals, this also leads to a rethinking of how to elicit and express causal knowledge. We provide an explicit algorithm and implementation of this idea, and assess its practicality with numerical experiments.  ( 2 min )
    Sampling in Dirichlet Process Mixture Models for Clustering Streaming Data. (arXiv:2202.13312v1 [cs.LG])
    Practical tools for clustering streaming data must be fast enough to handle the arrival rate of the observations. Typically, they also must adapt on the fly to possible lack of stationarity; i.e., the data statistics may be time-dependent due to various forms of drifts, changes in the number of clusters, etc. The Dirichlet Process Mixture Model (DPMM), whose Bayesian nonparametric nature allows it to adapt its complexity to the data, seems a natural choice for the streaming-data case. In its classical formulation, however, the DPMM cannot capture common types of drifts in the data statistics. Moreover, and regardless of that limitation, existing methods for online DPMM inference are too slow to handle rapid data streams. In this work we propose adapting both the DPMM and a known DPMM sampling-based non-streaming inference method for streaming-data clustering. We demonstrate the utility of the proposed method on several challenging settings, where it obtains state-of-the-art results while being on par with other methods in terms of speed.  ( 2 min )
    Rule-based Evolutionary Bayesian Learning. (arXiv:2202.13778v1 [stat.ML])
    In our previous work, we introduced the rule-based Bayesian Regression, a methodology that leverages two concepts: (i) Bayesian inference, for the general framework and uncertainty quantification and (ii) rule-based systems for the incorporation of expert knowledge and intuition. The resulting method creates a penalty equivalent to a common Bayesian prior, but it also includes information that typically would not be available within a standard Bayesian context. In this work, we extend the aforementioned methodology with grammatical evolution, a symbolic genetic programming technique that we utilise for automating the rules' derivation. Our motivation is that grammatical evolution can potentially detect patterns from the data with valuable information, equivalent to that of expert knowledge. We illustrate the use of the rule-based Evolutionary Bayesian learning technique by applying it to synthetic as well as real data, and examine the results in terms of point predictions and associated uncertainty.  ( 2 min )
    Hierarchical Gaussian Process Models for Regression Discontinuity/Kink under Sharp and Fuzzy Designs. (arXiv:2110.00921v2 [stat.ML] UPDATED)
    We propose nonparametric Bayesian estimators for causal inference exploiting Regression Discontinuity/Kink (RD/RK) under sharp and fuzzy designs. Our estimators are based on Gaussian Process (GP) regression and classification. The GP methods are powerful probabilistic machine learning approaches that are advantageous in terms of derivative estimation and uncertainty quantification, facilitating RK estimation and inference of RD/RK models. These estimators are extended to hierarchical GP models with an intermediate Bayesian neural network layer and can be characterized as hybrid deep learning models. Monte Carlo simulations show that our estimators perform comparably to and sometimes better than competing estimators in terms of precision, coverage and interval length. The hierarchical GP models considerably improve upon one-layer GP models. We apply the proposed methods to estimate the incumbency advantage of US house elections. Our estimations suggest a significant incumbency advantage in terms of both vote share and probability of winning in the next elections. Lastly we present an extension to accommodate covariate adjustment.  ( 2 min )
    Rethinking Multidimensional Discriminator Output for Generative Adversarial Networks. (arXiv:2109.03378v2 [stat.ML] UPDATED)
    The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting. Proofs of properties are based on our proposed maximal p-centrality discrepancy, which is bounded above by p-Wasserstein distance and fits the Wasserstein GAN framework with multidimensional critic output n. Especially when n = 1 and p = 1, the proposed discrepancy equals 1-Wasserstein distance. Theoretical analysis and empirical evidence show that high-dimensional critic output has its advantage on distinguishing real and fake distributions, and benefits faster convergence and diversity of results.  ( 2 min )
    Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons. (arXiv:2202.13163v1 [stat.ML])
    We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain unknown. The aim of this paper is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings.  ( 2 min )
    Scalable Gaussian-process regression and variable selection using Vecchia approximations. (arXiv:2202.12981v1 [stat.ME])
    Gaussian process (GP) regression is a flexible, nonparametric approach to regression that naturally quantifies uncertainty. In many applications, the number of responses and covariates are both large, and a goal is to select covariates that are related to the response. For this setting, we propose a novel, scalable algorithm, coined VGPR, which optimizes a penalized GP log-likelihood based on the Vecchia GP approximation, an ordered conditional approximation from spatial statistics that implies a sparse Cholesky factor of the precision matrix. We traverse the regularization path from strong to weak penalization, sequentially adding candidate covariates based on the gradient of the log-likelihood and deselecting irrelevant covariates via a new quadratic constrained coordinate descent algorithm. We propose Vecchia-based mini-batch subsampling, which provides unbiased gradient estimators. The resulting procedure is scalable to millions of responses and thousands of covariates. Theoretical analysis and numerical studies demonstrate the improved scalability and accuracy relative to existing methods.  ( 2 min )
    Estimating Model Performance on External Samples from Their Limited Statistical Characteristics. (arXiv:2202.13683v1 [stat.ML])
    Methods that address data shifts usually assume full access to multiple datasets. In the healthcare domain, however, privacy-preserving regulations as well as commercial interests limit data availability and, as a result, researchers can typically study only a small number of datasets. In contrast, limited statistical characteristics of specific patient samples are much easier to share and may be available from previously published literature or focused collaborative efforts. Here, we propose a method that estimates model performance in external samples from their limited statistical characteristics. We search for weights that induce internal statistics that are similar to the external ones; and that are closest to uniform. We then use model performance on the weighted internal sample as an estimation for the external counterpart. We evaluate the proposed algorithm on simulated data as well as electronic medical record data for two risk models, predicting complications in ulcerative colitis patients and stroke in women diagnosed with atrial fibrillation. In the vast majority of cases, the estimated external performance is much closer to the actual one than the internal performance. Our proposed method may be an important building block in training robust models and detecting potential model failures in external environments.  ( 2 min )
    Graph Attention Retrospective. (arXiv:2202.13060v1 [cs.LG])
    Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular type of models is graph attention networks. These models were introduced to allow a node to aggregate information from the features of neighbor nodes in a non-uniform way in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here the features of the nodes are obtained from a mixture of Gaussians and the edges from a stochastic block model where the features and the edges are coupled in a natural way. First, we show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention maintains the weights of intra-class edges and significantly reduces the weights of the inter-class edges. As a corollary, we show that this implies perfect node classification independent of the weights of inter-class edges. However, a classical argument shows that in the "easy" regime, the graph is not needed at all to classify the data with high probability. In the "hard" regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. We evaluate our theoretical results on synthetic and real-world data.  ( 2 min )
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v1 [cs.LG] CROSS LISTED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee.  ( 2 min )
    Nonperturbative renormalization for the neural network-QFT correspondence. (arXiv:2108.01403v2 [hep-th] UPDATED)
    In a recent work arXiv:2008.08601, Halverson, Maiti and Stoner proposed a description of neural networks in terms of a Wilsonian effective field theory. The infinite-width limit is mapped to a free field theory, while finite $N$ corrections are taken into account by interactions (non-Gaussian terms in the action). In this paper, we study two related aspects of this correspondence. First, we comment on the concepts of locality and power-counting in this context. Indeed, these usual space-time notions may not hold for neural networks (since inputs can be arbitrary), however, the renormalization group provides natural notions of locality and scaling. Moreover, we comment on several subtleties, for example, that data components may not have a permutation symmetry: in that case, we argue that random tensor field theories could provide a natural generalization. Second, we improve the perturbative Wilsonian renormalization from arXiv:2008.08601 by providing an analysis in terms of the nonperturbative renormalization group using the Wetterich-Morris equation. An important difference with usual nonperturbative RG analysis is that only the effective (IR) 2-point function is known, which requires setting the problem with care. Our aim is to provide a useful formalism to investigate neural networks behavior beyond the large-width limit (i.e.~far from Gaussian limit) in a nonperturbative fashion. A major result of our analysis is that changing the standard deviation of the neural network weight distribution can be interpreted as a renormalization flow in the space of networks. We focus on translations invariant kernels and provide preliminary numerical results.  ( 2 min )
    Learning over No-Preferred and Preferred Sequence of Items for Robust Recommendation (Extended Abstract). (arXiv:2202.13240v1 [cs.IR])
    This paper is an extended version of [Burashnikova et al., 2021, arXiv: 2012.06910], where we proposed a theoretically supported sequential strategy for training a large-scale Recommender System (RS) over implicit feedback, mainly in the form of clicks. The proposed approach consists in minimizing pairwise ranking loss over blocks of consecutive items constituted by a sequence of non-clicked items followed by a clicked one for each user. We present two variants of this strategy where model parameters are updated using either the momentum method or a gradient-based approach. To prevent updating the parameters for an abnormally high number of clicks over some targeted items (mainly due to bots), we introduce an upper and a lower threshold on the number of updates for each user. These thresholds are estimated over the distribution of the number of blocks in the training set. They affect the decision of RS by shifting the distribution of items that are shown to the users. Furthermore, we provide a convergence analysis of both algorithms and demonstrate their practical efficiency over six large-scale collections with respect to various ranking measures.  ( 2 min )
    Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation. (arXiv:2202.13863v1 [cs.LG])
    We study the convergence of the actor-critic algorithm with nonlinear function approximation under a nonconvex-nonconcave primal-dual formulation. Stochastic gradient descent ascent is applied with an adaptive proximal term for robust learning rates. We show the first efficient convergence result with primal-dual actor-critic with a convergence rate of $\mathcal{O}\left(\sqrt{\frac{\ln \left(N d G^2 \right)}{N}}\right)$ under Markovian sampling, where $G$ is the element-wise maximum of the gradient, $N$ is the number of iterations, and $d$ is the dimension of the gradient. Our result is presented with only the Polyak-\L{}ojasiewicz condition for the dual variables, which is easy to verify and applicable to a wide range of reinforcement learning (RL) scenarios. The algorithm and analysis are general enough to be applied to other RL settings, like multi-agent RL. Empirical results on OpenAI Gym continuous control tasks corroborate our theoretical findings.  ( 2 min )
    An Empirical Study on the Intrinsic Privacy of SGD. (arXiv:1912.02919v4 [cs.LG] UPDATED)
    Introducing noise in the training of machine learning systems is a powerful way to protect individual privacy via differential privacy guarantees, but comes at a cost to utility. This work looks at whether the inherent randomness of stochastic gradient descent (SGD) could contribute to privacy, effectively reducing the amount of \emph{additional} noise required to achieve a given privacy guarantee. We conduct a large-scale empirical study to examine this question. Training a grid of over 120,000 models across four datasets (tabular and images) on convex and non-convex objectives, we demonstrate that the random seed has a larger impact on model weights than any individual training example. We test the distribution over weights induced by the seed, finding that the simple convex case can be modelled with a multivariate Gaussian posterior, while neural networks exhibit multi-modal and non-Gaussian weight distributions. By casting convex SGD as a Gaussian mechanism, we then estimate an `intrinsic' data-dependent $\epsilon_i(\mathcal{D})$, finding values as low as 6.3, dropping to 1.9 using empirical estimates. We use a membership inference attack to estimate $\epsilon$ for non-convex SGD and demonstrate that hiding the random seed from the adversary results in a statistically significant reduction in attack performance, corresponding to a reduction in the effective $\epsilon$. These results provide empirical evidence that SGD exhibits appreciable variability relative to its dataset sensitivity, and this `intrinsic noise' has the potential to be leveraged to improve the utility of privacy-preserving machine learning.  ( 2 min )
    Neural net modeling of equilibria in NSTX-U. (arXiv:2202.13915v1 [physics.plasm-ph])
    Neural networks (NNs) offer a path towards synthesizing and interpreting data on faster timescales than traditional physics-informed computational models. In this work we develop two neural networks relevant to equilibrium and shape control modeling, which are part of a suite of tools being developed for the National Spherical Torus Experiment-Upgrade (NSTX-U) for fast prediction, optimization, and visualization of plasma scenarios. The networks include Eqnet, a free-boundary equilibrium solver trained on the EFIT01 reconstruction algorithm, and Pertnet, which is trained on the Gspert code and predicts the non-rigid plasma response, a nonlinear term that arises in shape control modeling. The equilibrium neural network is trained with different combinations of inputs and outputs in order to offer flexibility in use cases. In particular, the NN can use magnetic diagnostics as inputs for equilibrium prediction thus acting as a reconstruction code, or can use profiles and external currents as inputs to act as a traditional free-boundary Grad-Shafranov solver. We report strong performance for both networks indicating that these models could reliably be used within closed-loop simulations. Some limitations regarding generalizability and noise are discussed.  ( 2 min )
    Outlier-Robust Optimal Transport: Duality, Structure, and Statistical Analysis. (arXiv:2111.01361v3 [stat.ML] UPDATED)
    The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to be minimax optimal for robust estimation under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.  ( 2 min )
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v1 [cs.LG])
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, \emph{without}-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in \emph{any} sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related \emph{with}-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Nuances in Margin Conditions Determine Gains in Active Learning. (arXiv:2110.08418v2 [stat.ML] UPDATED)
    We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $E[Y|X]$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction between the two settings. Namely, we show that some seemingly benign nuances in notions of margin -- involving the uniqueness of the Bayes classifier, and which have no apparent effect on rates in passive learning -- determine whether or not any active learner can outperform passive learning rates. In particular, for Audibert-Tsybakov's margin condition (allowing general situations with non-unique Bayes classifiers), no active learner can gain over passive learning in commonly studied settings where the marginal on $X$ is near uniform. Our results thus negate the usual intuition from past literature that active rates should improve over passive rates in nonparametric settings.  ( 2 min )
    Multi-Objective Model-based Reinforcement Learning for Infectious Disease Control. (arXiv:2009.04607v3 [cs.LG] UPDATED)
    Severe infectious diseases such as the novel coronavirus (COVID-19) pose a huge threat to public health. Stringent control measures, such as school closures and stay-at-home orders, while having significant effects, also bring huge economic losses. In the face of an emerging infectious disease, a crucial question for policymakers is how to make the trade-off and implement the appropriate interventions timely given the huge uncertainty. In this work, we propose a Multi-Objective Model-based Reinforcement Learning framework to facilitate data-driven decision-making and minimize the overall long-term cost. Specifically, at each decision point, a Bayesian epidemiological model is first learned as the environment model, and then the proposed model-based multi-objective planning algorithm is applied to find a set of Pareto-optimal policies. This framework, combined with the prediction bands for each policy, provides a real-time decision support tool for policymakers. The application is demonstrated with the spread of COVID-19 in China.  ( 2 min )
    Typing assumptions improve identification in causal discovery. (arXiv:2107.10703v2 [cs.LG] UPDATED)
    Causal discovery from observational data is a challenging task that can only be solved up to a set of equivalent solutions, called an equivalence class. Such classes, which are often large in size, encode uncertainties about the orientation of some edges in the causal graph. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of variables, thus circumscribing the equivalence class. Namely, we introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph. We also propose causal discovery algorithms that make use of these assumptions and demonstrate their benefits on simulated and pseudo-real data.  ( 2 min )
    The Computational Asymptotics of Gaussian Variational Inference and the Laplace Aprroximation. (arXiv:2104.05886v2 [stat.CO] UPDATED)
    Gaussian variational inference and the Laplace approximation are popular alternatives to Markov chain Monte Carlo that formulate Bayesian posterior inference as an optimization problem, enabling the use of simple and scalable stochastic optimization algorithms. However, a key limitation of both methods is that the solution to the optimization problem is typically not tractable to compute; even in simple settings the problem is nonconvex. Thus, recently developed statistical guarantees -- which all involve the (data) asymptotic properties of the global optimum -- are not reliably obtained in practice. In this work, we provide two major contributions: a theoretical analysis of the asymptotic convexity properties of variational inference with a Gaussian family and the maximum a posteriori (MAP) problem required by the Laplace approximation; and two algorithms -- consistent Laplace approximation (CLA) and consistent stochastic variational inference (CSVI) -- that exploit these properties to find the optimal approximation in the asymptotic regime. Both CLA and CSVI involve a tractable initialization procedure that finds the local basin of the optimum, and CSVI further includes a scaled gradient descent algorithm that provably stays locally confined to that basin. Experiments on nonconvex synthetic and real-data examples show that compared with standard variational and Laplace approximations, both CSVI and CLA improve the likelihood of obtaining the global optimum of their respective optimization problems.  ( 2 min )
    KL Divergence Estimation with Multi-group Attribution. (arXiv:2202.13576v1 [cs.LG])
    Estimating the Kullback-Leibler (KL) divergence between two distributions given samples from them is well-studied in machine learning and information theory. Motivated by considerations of multi-group fairness, we seek KL divergence estimates that accurately reflect the contributions of sub-populations to the overall divergence. We model the sub-populations coming from a rich (possibly infinite) family $\mathcal{C}$ of overlapping subsets of the domain. We propose the notion of multi-group attribution for $\mathcal{C}$, which requires that the estimated divergence conditioned on every sub-population in $\mathcal{C}$ satisfies some natural accuracy and fairness desiderata, such as ensuring that sub-populations where the model predicts significant divergence do diverge significantly in the two distributions. Our main technical contribution is to show that multi-group attribution can be derived from the recently introduced notion of multi-calibration for importance weights [HKRR18, GRSW21]. We provide experimental evidence to support our theoretical results, and show that multi-group attribution provides better KL divergence estimates when conditioned on sub-populations than other popular algorithms.  ( 2 min )
    Silhouettes and quasi residual plots for neural nets and tree-based classifiers. (arXiv:2106.08814v2 [stat.ML] UPDATED)
    Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis (Rousseeuw, 1987). The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on benchmark data sets containing images, mixed features, and tweets.  ( 2 min )

  • Open

    [D] How do you manage your ML projects?
    Since ML project lifecycle is usually longer, I would like to understand what tools do you use to manage your projects in your pipeline such as Project scoping (if you are manager), Data Collection, Feature Engineering, Modeling , Deployment? In my experience, different people involved at different stages so we end up having the information at multiple places such as slack, JIRA etc. submitted by /u/rajg88 [link] [comments]  ( 1 min )
    [D] Mathematics for Machine Learning
    Will high school (A-Level) mathematics be enough to learn machine learning? Or Should I take higher mathematics? I am planning on taking the CS229 course next year, and am trying to build up the required knowledge to get there. ​ For context, I am a high school student with a keen interest in computer science, and machine learning. I am currently in my second to last year of high school, however this will be my last year of high school mathematics (I am an accelerant math student obviously lol). My very unrealistic dream is to attend Stanford University to study computer science. submitted by /u/Global_Border_9289 [link] [comments]  ( 1 min )
    [D] Options for hosting app
    Hello, I've created my first ML model, which is a .h5 file, and connected it to a flask web server. I want to host this so that I am able to make post requests with images for my model to predict from another app. The thing is, I'm not sure how to go about this as my app is 200MB, so too big even to push to Github. Does anyone have any suggestions, please? submitted by /u/Brilliant_Island_935 [link] [comments]  ( 1 min )
    [P] Python fastest auto ARIMA extension: Exogenous variables
    Last week we released the fastest ⚡️ version of auto ARIMA ever made in Python. As the community requested, today we are thrilled to announce that we extended the ARIMA to include exogenous variables, essential for accuracy. 🔥 Our auto.arima is 17x faster than pmdarima and 500x faster than FB-prophet, or something like that; we are still waiting… 😕 ​ https://preview.redd.it/usv467nfjuk81.png?width=1020&format=png&auto=webp&s=ffe858f90c1c78b68708ae43d7b0f5b18b95ed19 Table: Computation time and mean absolute scaled error (mase) vs. seasonal naive, for the day-ahead electricity price forecasting task, of the NordPool, Pennsylvania, Belgium, France, Germany markets using exogenous production of renewable sources, electricity load, and calendar features. Datasets and experiments are available here. Please check it out and give us a star if you like it https://github.com/Nixtla/statsforecast submitted by /u/fedegarzar [link] [comments]  ( 1 min )
    Machine learning tools for single-cell genomics research [D]
    How to utilize ML tools for gene regulatory networks and perturbation predictions: https://www.immunai.com/press/advancements-in-gene-regulatory-networks-immunais-third-symposium submitted by /u/pahita [link] [comments]
    [D] Measures of training convergence
    Hi everyone, Today I was reading the paper Fantastic Generalization Measures and Where to Find Them and stumbled upon the convergence criterion that they use. Specifically, in the paper they say: Convergence criterion is chosen as when cross-entropy loss reaches the value 0.01. I find quite useful to have a hard number (or an order of magnitude) for the loss to decide if the model has converged. My question is: do you know of any other number/rule/criterion to assess the convergenge of a training beside looking at other metrics? Are they useful in your experience? submitted by /u/Happy-Star3710 [link] [comments]  ( 1 min )
    [D] Looking to chat with start-ups that use ML/AI tools!
    Hi, I’m conducting undergraduate research with Carnegie Mellon University’s Human-Computer Interaction Institute and I’m looking to interview ML/AI practitioners at start-up companies. This paper is exploring the challenges that practitioners face at start-ups that may not have the resources of “big tech” companies, how practitioners overcome / deal with these problems, and what tools (real or hypothetical) that they wish they had at their disposal. By “ML/AI practitioner”, I’m referring to anyone who is involved in any state of the ML model prototyping pipeline; this could be data scientists, machine learning engineers, software engineers, quality assurance teams, or customer service. If you’re open to being interviewed as a COMPLETELY ANONYMOUS subject for this IRB-approved study, please comment and/or message me! Thanks in advance! submitted by /u/94theses [link] [comments]  ( 1 min )
    [N] Using Analog computers for AI
    Veritasium posted a great video on the upcoming analog computers being used for neural networks as an alternative to digital computers. https://www.youtube.com/watch?v=GVsUOuSjvcg&ab_channel=Veritasium submitted by /u/Random-Machine [link] [comments]  ( 1 min )
    [P] Open data transformations in Python, no SQL required
    Hi everyone! I created RasgoQL as an open source python package for building dbt-compatible SQL in a pandas-like syntax. It’s already saved me hours of writing CTEs (common table expressions) in SQL, and I hope you’ll give it a try so it can save you time too. ​ You can check it out here: https://github.com/rasgointelligence/RasgoQL submitted by /u/p5256 [link] [comments]  ( 1 min )
    [D] Quitting machine learning for good
    Hi everyone, I'm of those (rare??) persons who does ML for a living but has no interest in doing it. I've built models using classical and deep learning approaches for 7-8 years, and a lot of them had decent impacts in the companies I've worked with. I'm good at what I do and I'm compensated well for it. However, nothing in the field of ML/DL excites me anymore. I find it more enjoyable to solve problems in my math textbooks . In fact, I want a career in which I can do some form of mathematics but I don't want to do machine learning for the rest of my life. Before I shifted to ML for the money, I worked a lot on satellite systems engineering. I also took a lot of physics and EE courses during my masters degree (optics, quantum mechanics and solid state devices). I was thinking of a career in quantum information but I don't have a PhD yet. Also, my computer science skills aren't strong enough to switch to cryptography. Any thoughts / ideas on how to get out of machine learning? submitted by /u/madhav1113 [link] [comments]  ( 4 min )
    [D] Synthetic data for AI among the 10 Breakthrough Technologies 2022 of the MIT Tech Review
    Synthetic datasets are computer-generated samples with the same statistical characteristics as the samples from the original dataset. Synthetic datasets are becoming common to train AIs in areas where real data is scarce or too sensitive to use, as in the case of medical records or personal financial data. I was involved in textual data augmentation for my thesis. submitted by /u/ClaudeCoulombe [link] [comments]  ( 2 min )
    [D] Is regression imputation of missing data logically circular?
    In regression imputation, we are filling in missing features by building a regression model predicting that feature's value from the value of other features. I can't help but feel that this is logically circular. You predict missing feature value and then use those imputed data to build your actual classification or regression model. It seems whatever information those imputed values contain would be already contained in the other features, so you are contributing anything new with imputation? submitted by /u/jucheonsun [link] [comments]  ( 2 min )
  • Open

    Data Science Notebook Life-Hacks I Learned From Ploomber
    Sponsored Post Me, a data scientist, and Jupyter notebooks. Well, our relationship started back then when I began to learn […] The post Data Science Notebook Life-Hacks I Learned From Ploomber appeared first on Machine Learning Mastery.  ( 5 min )
    A Gentle Introduction to Serialization for Python
    Serialization refers to the process of converting a data object (e.g. Python objects, Tensorflow models) into a format that allows […] The post A Gentle Introduction to Serialization for Python appeared first on Machine Learning Mastery.  ( 11 min )
  • Open

    Varicode
    Varicode is a way of encoding text and control characters into binary using code words of variable length. It was developed as part of the PSK31 protocol for digital communication over amateur radio. In the spirit of Morse code, it uses short code words for common characters and longer code words for less common characters […] Varicode first appeared on John D. Cook.  ( 3 min )
    Catenary kiln
    Will Buckner sent me an email with the following question recently. (I’m sharing this with permission.) I am building a kiln using a catenary arch. The rear wall and front wall/door will be vertical and fill in the space under the arch, which has the dimensions of 41″W x 39.5″H. I need the area within […] Catenary kiln first appeared on John D. Cook.  ( 2 min )
  • Open

    DSC Weekly Digest 01 March 2022: Taxonomists Classify, Ontologists Conceptualize
    We’re moving at the Cagle house and we’re discovering that, after eight years of living at the same place, one family can collect a lot of crap. The issue came up, in discussions with my spouse, that my mother-in-law had no sense of organization — which seemed odd because my wife’s mother was the kind… Read More »DSC Weekly Digest 01 March 2022: Taxonomists Classify, Ontologists Conceptualize The post DSC Weekly Digest 01 March 2022: Taxonomists Classify, Ontologists Conceptualize appeared first on Data Science Central.  ( 6 min )
    The Hybrid to Give Your AI the Gift of Knowledge
    Data Points  Hybrid AI is already overcoming the limitations of single technique approaches  Symbolic AI understands actual knowledge, not just data  Hybrid solutions open the “black box” of AI, helping to democratize the technology across the workforce   Symbolic AI and machine learning/deep learning all have their own set of strengths that, when used together in… Read More »The Hybrid to Give Your AI the Gift of Knowledge The post The Hybrid to Give Your AI the Gift of Knowledge appeared first on Data Science Central.  ( 5 min )
    Ten years of Google Knowledge Graph
    It’s been ten years since Google (now a child of holding company Alphabet) coined the term “knowledge graph” and described (in general terms) how their knowledge graph worked. And it’s been over 20 years since Tim Berners-Lee, James Hendler and Ora Lassila published their first article to describe the semantic web they envisioned. Many knowledge… Read More »Ten years of Google Knowledge Graph The post Ten years of Google Knowledge Graph appeared first on Data Science Central.  ( 4 min )
    How to Make Glowing Visualizations – Literally
    I explain here how to do it in Excel. But the principle is general. You can produce this type of visualization with other tools. There are many features in Excel that allow you to generate marketing-style pictures. I never really used them, as the result can be cheesy. If overused, you end up with material… Read More »How to Make Glowing Visualizations – Literally The post How to Make Glowing Visualizations – Literally appeared first on Data Science Central.  ( 3 min )
  • Open

    Mental Model For Deploying RL Models
    I have experience training and deploying ML models, and understand the mechanics. I dabbled in RL a while back but never really understood what it meant to deploy a trained model. All the material I came across was all about the algorithms or training. Can I get a high level overview? Thanks in advance. submitted by /u/santeryl [link] [comments]  ( 1 min )
    Advantages of Discrete Control in Reinforcement Learning?
    I am trying to understand the difference between Discrete and Continuous control algorithms in RL, and from what I've been able to tell it's that Discrete Control methods seem to predict one hot encoded vectors and Continuous control methods output continuous values that can be used to achieve a softmax-like functionality. If this understanding is correct, than what is the advantage of ever using Discrete control algorithms, when you can just use an argmax on the output of a Continuous control algorithm? submitted by /u/uom_questions [link] [comments]  ( 1 min )
    RL agent that does not really improve performance.
    I'm trying to train a small reinforcement learning agent that uses a NN to calculate the Q-value for each action (2 actions). The agent is playing OpenAI's CartPole game. I don't really get it to learn anything/improve the score. Does anyone have some experience with RL and would like to give me some hints? This is my code: https://pastebin.com/6VcCHQ07 The Q-value seems to improve at the beggining but the survival time for the game remains as if I were taking random decisions. ​ ​ https://preview.redd.it/13ilr8tj0tk81.png?width=365&format=png&auto=webp&s=7e2360d15d95fa1d9eac1779c8b32f69b613a399 ​ https://preview.redd.it/0fs5glfk0tk81.png?width=376&format=png&auto=webp&s=f9e41d2604e799bdf7efed9e8a19d1a7d272ae5d submitted by /u/LanverYT [link] [comments]  ( 1 min )
    Reward formulation
    Hello ! It may be a stupid question but I´´´´´ am new on RL. The question is : I have a problem that consists to minimize the error of position estimation so I don´`´´´´t know if I should use the standard error -abs(true position - estimated position), in this case sometimes I will have a reward of -0.0003 and sometimes it is in the range of -0.9, that won´t cause a problem during training?. The second question is : the reward must be always in [0, 1] for DDPG or it can be in any range submitted by /u/GuavaAgreeable208 [link] [comments]  ( 2 min )
    Multi-agent bandits with partially shared arms
    Hello guys, I am considering a multi-agent MAB problem where agents have different sets of arms, but some arms are shared. In other words, there is a global set of arms, but each agent only has access to a subset of these arms (which does not necessarily include the best one). I think it could be an interesting setting and looked for related publications in the MAB literature, but without success. Does anyone know about research papers addressing this setting? Thanks submitted by /u/xalendrio [link] [comments]  ( 1 min )
    OpenAI Gym Custom Environment Observation Space returns "None"
    After setting up a custom environment, I was testing whether my observation_space and action_space were properly defined. I was able to call: - env.observation_space and get the properly defined observation_space - env.observation_space.sample() and get a well-working sample Though when calling env.observation_space.shape, I got "None" as a return which doesn't make sense considering I could sample it properly. Looking it up on GitHub / other sites didn't yield any proper help - afaik. The observation_space is a nested tuple, containing only tuples and discrete values. submitted by /u/JonasMArnold [link] [comments]  ( 1 min )
  • Open

    Announcing the AWS DeepRacer League 2022
    Unleash the power of machine learning (ML) through hands-on learning and compete for prizes and glory. The AWS DeepRacer League is the world’s first global autonomous racing competition driven by reinforcement learning; bringing together students, professionals, and enthusiasts from almost every continent. I’m Tomasz Ptak, a senior software engineer at Duco, an AWS Machine Learning […]  ( 5 min )
    Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker
    The last few years have seen rapid development in the field of natural language processing (NLP). While hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly run into issues scaling their large language models across multiple GPU’s. In this blog post, we […]  ( 9 min )
    ML inferencing at the edge with Amazon SageMaker Edge and Ambarella CV25
    Ambarella builds computer vision SoCs (system on chips) based on a very efficient AI chip architecture and CVflow that provides the Deep Neural Network (DNN) processing required for edge inferencing use cases like intelligent home monitoring and smart surveillance cameras. Developers convert models trained with frameworks (such as TensorFlow or MXNET) to Ambarella CVflow format […]  ( 7 min )
    Anomaly detection with Amazon SageMaker Edge Manager using AWS IoT Greengrass V2
    Deploying and managing machine learning (ML) models at the edge requires a different set of tools and skillsets as compared to the cloud. This is primarily due to the hardware, software, and networking restrictions at the edge sites. This makes deploying and managing these models more complex. An increasing number of applications, such as industrial […]  ( 13 min )
  • Open

    Co-training Transformer with Videos and Images Improves Action Recognition
    Posted by Bowen Zhang, Student Researcher and Jiahui Yu, Senior Research Scientist, Google Research, Brain Team Action recognition has become a major focus area for the research community because many applications can benefit from improved modeling, such as video retrieval, video captioning, video question-answering, etc. Transformer-based approaches have recently demonstrated state-of-the-art performance on several benchmarks. While Transformer models require data to learn better visual priors compared to ConvNets, action recognition datasets are relatively small in scale. Large Transformer models are typically first trained on image datasets and later fine-tuned on a target action recognition dataset. While the current pre-training and fine-tuning action recognition paradigm is stra…  ( 7 min )
  • Open

    Neural network without hidden layers
    Would a neural network with only input and output units (and nonlinear activation function at the output node) still be considered a neural network? Would the nonlinearity of the activation function allow what would be just a polynomial to be considered in the real of "artificial neural networks?" submitted by /u/Cogitarius [link] [comments]  ( 1 min )
    Keras throwing a labels shape mismatch error
    I am building a multiclass segmentation model using DeepLapv3+ and ResNet50. I started off with this tutorial but altered much of the code for my use case. https://keras.io/examples/vision/deeplabv3_plus/ In this block, I am processing my data: # CIHP has 20 labels and Headsegmentation has 14 labels image_size = 512 batch = 4 labels = 14 data_directory = "/content/headsegmentation_final/" sample_train_images = len(os.listdir(data_directory + 'Training/Images/')) - 1 sample_validation_images = len(os.listdir(data_directory + 'Validation/Images/')) - 1 print('Train size: ' + str(sample_train_images)) print('Validation size: ' + str(sample_validation_images)) t_images = sorted(glob(os.path.join(data_directory, "Training/Images/*")))[:sample_train_images] t_masks = sorted(glob(os.path.join(…  ( 2 min )
  • Open

    Hey Folks, A team of researchers at Meta AI is working on developing language and machine translation capabilities that will cover most of the world’s languages. Their work includes two projects: No Language Left Behind and Universal Speech Translator.
    A team of researchers at Meta AI is working on developing language and machine translation capabilities that will cover most of the world’s languages. Their work includes two projects: No Language Left Behind, a new AI model that can be trained to learn languages with fewer examples allowing expert-quality translations in hundreds of languages, from Asturian to Luganda to Urdu. The second project is Universal Speech Translator, which includes unique methods for translating from one language’s speech to another in real-time. This will enable models to support languages that do not have a regular writing system. You can read my full summary of this research here or even check out Meta's Blog https://preview.redd.it/sd0woclausk81.png?width=1508&format=png&auto=webp&s=93e4d4a47b76648456f2511b43084c25a5e8c177 submitted by /u/techsucker [link] [comments]  ( 1 min )
    Injecting fairness into machine-learning models
    submitted by /u/qptbook [link] [comments]
    Artificial Intelligence News
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Fingertip sensitivity for robots: A soft thumb-sized vision-based sensor with accurate all-round force perception
    submitted by /u/leonardo-vinci [link] [comments]
    Mark Zuckerberg Describes the Evolution of Facebook's AI System for Content Moderation & Why 2016 Was a Turning Point (short audio clip)
    submitted by /u/frog9913 [link] [comments]
    Teaching an AI How to KILL.... (in a video game)
    submitted by /u/Ziinxx [link] [comments]
    AI that generates math/physic equations/theories?
    Can something like that be done? For example, the Kaluza-Klein theory was able to derive various existing physic equations using an 5 by 5 matrix. My idea comes from the first 4 minutes of this video https://www.youtube.com/watch?v=mmtLgYVEuJs submitted by /u/netblazer [link] [comments]  ( 1 min )
    In a Latest Research From MIT and Harvard, The Research Team Finds When and How a Machine-Learning Model is Capable of Overcoming Dataset Bias
    It is no doubt that machine learning has become an inseparable part of our daily lives. Today, ML algorithms recommend movies, goods to buy, and even someone to date. They demonstrate the advantages of algorithmic decision-making. Unlike individuals, machines do not get weary or bored, and they can consider orders of magnitude more factors than humans. However, it is difficult to understand how ML models make decisions, and they are often biased. If the datasets are used to train machine-learning algorithms to contain biassed data, the system is likely to make decisions with the same bias in practice. ML fairness is a relatively new branch of machine learning that looks at methods to ensure that data biases and model flaws don’t lead to models discriminating against people based on their ethnicity, gender, handicap, or sexual or political orientation. Feel free to read the full summary here or check out the paper https://preview.redd.it/v4sjb2uhdpk81.png?width=1536&format=png&auto=webp&s=0d069681ac719185b262b4990b0ca671db8e3ba0 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Day-to-day tasks as a planning problem
    AI is not alien for to-do applications. Microsoft has products/plugins that use AI-powered task suggestions. Although, it seems like they are keyword based, which then becomes an NLP problem. Since a daily schedule has inherent sequence in it, I was wondering if it makes sense modeling a workday as a planning problem, say, describe it as an MDP. Why? Maybe it can explain/assist in productivity in a broader sense. Do you all think this train of thought has any merit? submitted by /u/martiandrongo [link] [comments]  ( 1 min )
  • Open

    What Is GauGAN? How AI Turns Your Words and Pictures Into Stunning Art
    GauGAN, an AI demo for photorealistic image generation, allows anyone to create stunning landscapes using generative adversarial networks. Named after post-Impressionist painter Paul Gauguin, it was created by NVIDIA Research and can be experienced free through NVIDIA AI Demos. How to Create With GauGAN The latest version of the demo, GauGAN2, turns any combination of Read article > The post What Is GauGAN? How AI Turns Your Words and Pictures Into Stunning Art appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Study examines how machine learning boosts manufacturing
    Researchers surveyed 100 high-performing companies to determine which of them are leading adopters of machine intelligence and data analytics, and how they succeed.  ( 5 min )
    Injecting fairness into machine-learning models
    A new technique boosts models’ ability to reduce bias, even if the dataset used to train the model is unbalanced.  ( 7 min )
  • Open

    Non-parametric Kernel-Based Estimation of Probability Distributions for Precipitation Modeling. (arXiv:2109.09961v2 [stat.AP] UPDATED)
    The probability distribution of precipitation amount strongly depends on geography, climate zone, and time scale considered. Closed-form parametric probability distributions are not sufficiently flexible to provide accurate and universal models for precipitation amount over different time scales. In this paper we derive non-parametric estimates of the cumulative distribution function (CDF) of precipitation amount for wet periods. The CDF estimates are obtained by integrating the kernel density estimator leading to semi-explicit CDF expressions for different kernel functions. We investigate an adaptive plug-in bandwidth (KCDE), using both synthetic data sets and reanalysis precipitation data from the Mediterranean island of Crete (Greece). We show that KCDE provides better estimates of the probability distribution than the standard empirical (staircase) estimate and kernel-based estimates that use the normal reference bandwidth. We also demonstrate that KCDE enables the simulation of non-parametric precipitation amount distributions by means of the inverse transform sampling method.  ( 2 min )
    Federated Self-Training for Semi-Supervised Audio Recognition. (arXiv:2107.06877v2 [cs.LG] UPDATED)
    Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices like smartphones and virtual assistants, labeling is entrusted to the clients, or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users' devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence to within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully supervised federated model.  ( 2 min )
    KerGNNs: Interpretable Graph Neural Networks with Graph Kernels. (arXiv:2201.00491v2 [cs.LG] UPDATED)
    Graph kernels are historically the most widely-used technique for graph classification tasks. However, these methods suffer from limited performance because of the hand-crafted combinatorial features of graphs. In recent years, graph neural networks (GNNs) have become the state-of-the-art method in downstream graph-related tasks due to their superior performance. Most GNNs are based on Message Passing Neural Network (MPNN) frameworks. However, recent studies show that MPNNs can not exceed the power of the Weisfeiler-Lehman (WL) algorithm in graph isomorphism test. To address the limitations of existing graph kernel and GNN methods, in this paper, we propose a novel GNN framework, termed \textit{Kernel Graph Neural Networks} (KerGNNs), which integrates graph kernels into the message passing process of GNNs. Inspired by convolution filters in convolutional neural networks (CNNs), KerGNNs adopt trainable hidden graphs as graph filters which are combined with subgraphs to update node embeddings using graph kernels. In addition, we show that MPNNs can be viewed as special cases of KerGNNs. We apply KerGNNs to multiple graph-related tasks and use cross-validation to make fair comparisons with benchmarks. We show that our method achieves competitive performance compared with existing state-of-the-art methods, demonstrating the potential to increase the representation ability of GNNs. We also show that the trained graph filters in KerGNNs can reveal the local graph structures of the dataset, which significantly improves the model interpretability compared with conventional GNN models.  ( 2 min )
    Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning. (arXiv:2110.04683v2 [cs.LG] UPDATED)
    State-of-the-art approaches for clustering high-dimensional data utilize deep auto-encoder architectures. Many of these networks require a large number of parameters and suffer from a lack of interpretability, due to the black-box nature of the auto-encoders. We introduce Mixture Model Auto-Encoders (MixMate), a novel architecture that clusters data by performing inference on a generative model. Derived from the perspective of sparse dictionary learning and mixture models, MixMate comprises several auto-encoders, each tasked with reconstructing data in a distinct cluster, while enforcing sparsity in the latent space. Through experiments on various image datasets, we show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms, while using orders of magnitude fewer parameters.  ( 2 min )
    Convex Analysis of the Mean Field Langevin Dynamics. (arXiv:2201.10469v2 [stat.ML] UPDATED)
    As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics recently attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a concise and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a proximal Gibbs distribution $p_q$ associated with the dynamics, which, in combination with techniques in [Vempala and Wibisono (2019)], allows us to develop a simple convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that $p_q$ connects to the duality gap in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.  ( 2 min )
    Long Expressive Memory for Sequence Modeling. (arXiv:2110.04744v2 [cs.LG] UPDATED)
    We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies. LEM is gradient-based, it can efficiently process sequential tasks with very long-term dependencies, and it is sufficiently expressive to be able to learn complicated input-output maps. To derive LEM, we consider a system of multiscale ordinary differential equations, as well as a suitable time-discretization of this system. For LEM, we derive rigorous bounds to show the mitigation of the exploding and vanishing gradients problem, a well-known challenge for gradient-based recurrent sequential learning methods. We also prove that LEM can approximate a large class of dynamical systems to high accuracy. Our empirical results, ranging from image and time-series classification through dynamical systems prediction to speech recognition and language modeling, demonstrate that LEM outperforms state-of-the-art recurrent neural networks, gated recurrent units, and long short-term memory models.  ( 2 min )
    A Generalized Weighted Optimization Method for Computational Learning and Inversion. (arXiv:2201.09223v2 [cs.LG] UPDATED)
    The generalization capacity of various machine learning models exhibits different phenomena in the under- and over-parameterized regimes. In this paper, we focus on regression models such as feature regression and kernel regression and analyze a generalized weighted least-squares optimization method for computational learning and inversion with noisy data. The highlight of the proposed framework is that we allow weighting in both the parameter space and the data space. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. Here, we characterize the impact of the weighting scheme on the generalization error of the learning method, where we derive explicit generalization errors for the random Fourier feature model in both the under- and over-parameterized regimes. For more general feature maps, error bounds are provided based on the singular values of the feature matrix. We demonstrate that appropriate weighting from prior knowledge can improve the generalization capability of the learned model.  ( 2 min )
    AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis. (arXiv:2110.14880v4 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are proved to be vulnerable against backdoor attacks. A backdoor is often embedded in the target DNNs through injecting a backdoor trigger into training examples, which can cause the target DNNs misclassify an input attached with the backdoor trigger. Existing backdoor detection methods often require the access to the original poisoned training data, the parameters of the target DNNs, or the predictive confidence for each given input, which are impractical in many real-world applications, e.g., on-device deployed DNNs. We address the black-box hard-label backdoor detection problem where the DNN is fully black-box and only its final output label is accessible. We approach this problem from the optimization perspective and show that the objective of backdoor detection is bounded by an adversarial objective. Further theoretical and empirical studies reveal that this adversarial objective leads to a solution with highly skewed distribution; a singularity is often observed in the adversarial map of a backdoor-infected example, which we call the adversarial singularity phenomenon. Based on this observation, we propose the adversarial extreme value analysis(AEVA) to detect backdoors in black-box neural networks. AEVA is based on an extreme value analysis of the adversarial map, computed from the monte-carlo gradient estimation. Evidenced by extensive experiments across multiple popular tasks and backdoor attacks, our approach is shown effective in detecting backdoor attacks under the black-box hard-label scenarios.  ( 2 min )
    Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. (arXiv:2110.15909v2 [cs.LG] UPDATED)
    We investigate the performance on phoneme categorization and phoneme and word segmentation of several self-supervised learning (SSL) methods based on Contrastive Predictive Coding (CPC). Our experiments show that with the existing algorithms there is a trade off between categorization and segmentation performance. We investigate the source of this conflict and conclude that the use of context building networks, albeit necessary for superior performance on categorization tasks, harms segmentation performance by causing a temporal shift on the learned representations. Aiming to bridge this gap, we take inspiration from the leading approach on segmentation, which simultaneously models the speech signal at the frame and phoneme level, and incorporate multi-level modelling into Aligned CPC (ACPC), a variation of CPC which exhibits the best performance on categorization tasks. Our multi-level ACPC (mACPC) improves in all categorization metrics and achieves state-of-the-art performance in word segmentation.
    Heavy-tailed Streaming Statistical Estimation. (arXiv:2108.11483v2 [cs.LG] UPDATED)
    We consider the task of heavy-tailed statistical estimation given streaming $p$-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional $O(p)$ space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using $O(1)$ batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.
    Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement. (arXiv:2010.08532v2 [cs.LG] UPDATED)
    Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples. To obtain better generalization, using the starting point as the reference (SPAR), either through weights or features, has been successfully applied to transfer learning as a regularizer. However, due to the domain discrepancy between the source and target task, there exists obvious risk of negative transfer in a straightforward manner of knowledge preserving. In this paper, we propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED), where the relevant knowledge with respect to the target task is disentangled from the original source model and used as a regularizer during fine-tuning the target model. Specifically, we design two alternative methods, maximizing the Maximum Mean Discrepancy (Max-MMD) and minimizing the mutual information (Min-MI), for the representation disentanglement. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average. TRED also outperforms related state-of-the-art transfer learning regularizers such as L2-SP, AT, DELTA, and BSS.
    Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines. (arXiv:2107.06925v3 [cs.DC] UPDATED)
    Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
    AutoFR: Automated Filter Rule Generation for Adblocking. (arXiv:2202.12872v1 [cs.LG])
    Adblocking relies on filter lists, which are manually curated and maintained by a small community of filter list authors. This manual process is laborious and does not scale well to a large number of sites and over time. We introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation. We design an algorithm based on multi-arm bandits to generate filter rules while controlling the trade-off between blocking ads and avoiding breakage. We test our implementation of AutoFR on thousands of sites in terms of efficiency and effectiveness. AutoFR is efficient: it takes only a few minutes to generate filter rules for a site. AutoFR is also effective: it generates filter rules that can block 86% of the ads, as compared to 87% by EasyList while achieving comparable visual breakage. The filter rules generated by AutoFR generalize well to new and unseen sites. We envision AutoFR to assist the adblocking community in automated filter rule generation at scale.
    Reinforcement Learning for Selective Key Applications in Power Systems: Recent Advances and Future Challenges. (arXiv:2102.01168v5 [cs.LG] UPDATED)
    With large-scale integration of renewable generation and distributed energy resources, modern power systems are confronted with new operational challenges, such as growing complexity, increasing uncertainty, and aggravating volatility. Meanwhile, more and more data are becoming available owing to the widespread deployment of smart meters, smart sensors, and upgraded communication networks. As a result, data-driven control techniques, especially reinforcement learning (RL), have attracted surging attention in recent years. This paper provides a comprehensive review of various RL techniques and how they can be applied to decision-making and control in power systems. In particular, we select three key applications, i.e., frequency regulation, voltage control, and energy management, as examples to illustrate RL-based models and solutions. We then present the critical issues in the application of RL, i.e., safety, robustness, scalability, and data. Several potential future directions are discussed as well.
    Hybrid and Automated Machine Learning Approaches for Oil Fields Development: the Case Study of Volve Field, North Sea. (arXiv:2103.02598v3 [cs.LG] UPDATED)
    The paper describes the usage of intelligent approaches for field development tasks that may assist a decision-making process. We focused on the problem of wells location optimization and two tasks within it: improving the quality of oil production estimation and estimation of reservoir characteristics for appropriate wells allocation and parametrization, using machine learning methods. For oil production estimation, we implemented and investigated the quality of forecasting models: physics-based, pure data-driven, and hybrid one. The CRMIP model was chosen as a physics-based approach. We compare it with the machine learning and hybrid methods in a frame of oil production forecasting task. In the investigation of reservoir characteristics for wells location choice, we automated the seismic analysis using evolutionary identification of convolutional neural network for the reservoir detection. The Volve oil field dataset was used as a case study to conduct the experiments. The implemented approaches can be used to analyze different oil fields or adapted to similar physics-related problems.
    Reduced operator inference for nonlinear partial differential equations. (arXiv:2102.00083v2 [math.NA] UPDATED)
    We present a new scientific machine learning method that learns from data a computationally inexpensive surrogate model for predicting the evolution of a system governed by a time-dependent nonlinear partial differential equation (PDE), an enabling technology for many computational algorithms used in engineering settings. Our formulation generalizes to the function space PDE setting the Operator Inference method previously developed in [B. Peherstorfer and K. Willcox, Data-driven operator inference for non-intrusive projection-based model reduction, Computer Methods in Applied Mechanics and Engineering, 306 (2016)] for systems governed by ordinary differential equations. The method brings together two main elements. First, ideas from projection-based model reduction are used to explicitly parametrize the learned model by low-dimensional polynomial operators which reflect the known form of the governing PDE. Second, supervised machine learning tools are used to infer from data the reduced operators of this physics-informed parametrization. For systems whose governing PDEs contain more general (non-polynomial) nonlinearities, the learned model performance can be improved through the use of lifting variable transformations, which expose polynomial structure in the PDE. The proposed method is demonstrated on two examples: a heat equation model problem that demonstrates the benefits of the function space formulation in terms of consistency with the underlying continuous truth, and a three-dimensional combustion simulation with over 18 million degrees of freedom, for which the learned reduced models achieve accurate predictions with a dimension reduction of five orders of magnitude and model runtime reduction of up to nine orders of magnitude.
    A function space analysis of finite neural networks with insights from sampling theory. (arXiv:2004.06989v2 [cs.LG] UPDATED)
    This work suggests using sampling theory to analyze the function space represented by neural networks. First, it shows, under the assumption of a finite input domain, which is the common case in training neural networks, that the function space generated by multi-layer networks with non-expansive activation functions is smooth. This extends over previous works that show results for the case of infinite width ReLU networks. Then, under the assumption that the input is band-limited, we provide novel error bounds for univariate neural networks. We analyze both deterministic uniform and random sampling showing the advantage of the former.
    Meta-Learning for Simple Regret Minimization. (arXiv:2202.12888v1 [cs.LG])
    We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist algorithms for this meta-learning problem. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere $\tilde{O}(m / \sqrt{n})$. This is while we show that the meta simple regret of the frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$, and thus, worse. However, the algorithm is more general, because it does not need a prior distribution over the meta-parameters, and is easier to implement for various distributions. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.
    Dynamic Regret of Online Mirror Descent for Relatively Smooth Convex Cost Functions. (arXiv:2202.12843v1 [cs.LG])
    The performance of online convex optimization algorithms in a dynamic environment is often expressed in terms of the dynamic regret, which measures the decision maker's performance against a sequence of time-varying comparators. In the analysis of the dynamic regret, prior works often assume Lipschitz continuity or uniform smoothness of the cost functions. However, there are many important cost functions in practice that do not satisfy these conditions. In such cases, prior analyses are not applicable and fail to guarantee the optimization performance. In this letter, we show that it is possible to bound the dynamic regret, even when neither Lipschitz continuity nor uniform smoothness is present. We adopt the notion of relative smoothness with respect to some user-defined regularization function, which is a much milder requirement on the cost functions. We first show that under relative smoothness, the dynamic regret has an upper bound based on the path length and functional variation. We then show that with an additional condition of relatively strong convexity, the dynamic regret can be bounded by the path length and gradient variation. These regret bounds provide performance guarantees to a wide variety of online optimization problems that arise in different application domains. Finally, we present numerical experiments that demonstrate the advantage of adopting a regularization function under which the cost functions are relatively smooth.
    Biological error correction codes generate fault-tolerant neural networks. (arXiv:2202.12887v1 [cs.LG])
    It has been an open question in deep learning if fault-tolerant computation is possible: can arbitrarily reliable computation be achieved using only unreliable neurons? In the mammalian cortex, analog error correction codes known as grid codes have been observed to protect states against neural spiking noise, but their role in information processing is unclear. Here, we use these biological codes to show that a universal fault-tolerant neural network can be achieved if the faultiness of each neuron lies below a sharp threshold, which we find coincides in order of magnitude with noise observed in biological neurons. The discovery of a sharp phase transition from faulty to fault-tolerant neural computation opens a path towards understanding noisy analog systems in artificial intelligence and neuroscience.
    A Hybrid Spatial-temporal Deep Learning Architecture for Lane Detection. (arXiv:2110.04079v3 [cs.CV] UPDATED)
    Accurate and reliable lane detection is vital for the safe performance of lane-keeping assistance and lane departure warning systems. However, under certain challenging circumstances, it is difficult to get satisfactory performance in accurately detecting the lanes from one single image as mostly done in current literature. Since lane markings are continuous lines, the lanes that are difficult to be accurately detected in the current single image can potentially be better deduced if information from previous frames is incorporated. This study proposes a novel hybrid spatial-temporal (ST) sequence-to-one deep learning architecture. This architecture makes full use of the ST information in multiple continuous image frames to detect the lane markings in the very last frame. Specifically, the hybrid model integrates the following aspects: (a) the single image feature extraction module equipped with the spatial convolutional neural network; (b) the ST feature integration module constructed by ST recurrent neural network; (c) the encoder-decoder structure, which makes this image segmentation problem work in an end-to-end supervised learning format. Extensive experiments reveal that the proposed model architecture can effectively handle challenging driving scenes and outperforms available state-of-the-art methods.
    Tensor-Train Density Estimation. (arXiv:2108.00089v2 [cs.LG] UPDATED)
    Estimation of probability density function from samples is one of the central problems in statistics and machine learning. Modern neural network-based models can learn high dimensional distributions but have problems with hyperparameter selection and are often prone to instabilities during training and inference. We propose a new efficient tensor train-based model for density estimation (TTDE). Such density parametrization allows exact sampling, calculation of cumulative and marginal density functions, and partition function. It also has very intuitive hyperparameters. We develop an efficient non-adversarial training procedure for TTDE based on the Riemannian optimization. Experimental results demonstrate the competitive performance of the proposed method in density estimation and sampling tasks, while TTDE significantly outperforms competitors in training speed.
    Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects. (arXiv:2202.12891v1 [stat.ML])
    Estimating heterogeneous treatment effects is an important problem across many domains. In order to accurately estimate such treatment effects, one typically relies on data from observational studies or randomized experiments. Currently, most existing works rely exclusively on observational data, which is often confounded and, hence, yields biased estimates. While observational data is confounded, randomized data is unconfounded, but its sample size is usually too small to learn heterogeneous treatment effects. In this paper, we propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data via representation learning. In particular, we introduce a two-step framework: first, we use observational data to learn a shared structure (in form of a representation); and then, we use randomized data to learn the data-specific structures. We analyze the finite sample properties of our framework and compare them to several natural baselines. As such, we derive conditions for when combining observational and randomized data is beneficial, and for when it is not. Based on this, we introduce a sample-efficient algorithm, called CorNet. We use extensive simulation studies to verify the theoretical properties of CorNet and multiple real-world datasets to demonstrate our method's superiority compared to existing methods.
    Robust fine-tuning of zero-shot models. (arXiv:2109.01903v2 [cs.CV] UPDATED)
    Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference.
    Graph Neural Networks: Taxonomy, Advances and Trends. (arXiv:2012.08752v4 [cs.LG] UPDATED)
    Graph neural networks provide a powerful toolkit for embedding real-world graphs into low-dimensional spaces according to specific tasks. Up to now, there have been several surveys on this topic. However, they usually lay emphasis on different angles so that the readers can not see a panorama of the graph neural networks. This survey aims to overcome this limitation, and provide a comprehensive review on the graph neural networks. First of all, we provide a novel taxonomy for the graph neural networks, and then refer to up to 400 relevant literatures to show the panorama of the graph neural networks. All of them are classified into the corresponding categories. In order to drive the graph neural networks into a new stage, we summarize four future research directions so as to overcome the facing challenges. It is expected that more and more scholars can understand and exploit the graph neural networks, and use them in their research community.
    Test Score Algorithms for Budgeted Stochastic Utility Maximization. (arXiv:2012.15194v3 [cs.DS] UPDATED)
    Motivated by recent developments in designing algorithms based on individual item scores for solving utility maximization problems, we study the framework of using test scores, defined as a statistic of observed individual item performance data, for solving the budgeted stochastic utility maximization problem. We extend an existing scoring mechanism, namely the replication test scores, to incorporate heterogeneous item costs as well as item values. We show that a natural greedy algorithm that selects items solely based on their replication test scores outputs solutions within a constant factor of the optimum for a broad class of utility functions. Our algorithms and approximation guarantees assume that test scores are noisy estimates of certain expected values with respect to marginal distributions of individual item values, thus making our algorithms practical and extending previous work that assumes noiseless estimates. Moreover, we show how our algorithm can be adapted to the setting where items arrive in a streaming fashion while maintaining the same approximation guarantee. We present numerical results, using synthetic data and data sets from the Academia.StackExchange Q&A forum, which show that our test score algorithm can achieve competitiveness, and in some cases better performance than a benchmark algorithm that requires access to a value oracle to evaluate function values.
    An initial alignment between neural network and target is needed for gradient descent to learn. (arXiv:2202.12846v1 [cs.LG])
    This paper introduces the notion of "Initial Alignment" (INAL) between a neural network at initialization and a target function. It is proved that if a network and target function do not have a noticeable INAL, then noisy gradient descent on a fully connected network with normalized i.i.d. initialization will not learn in polynomial time. Thus a certain amount of knowledge about the target (measured by the INAL) is needed in the architecture design. This also provides an answer to an open problem posed in [AS20]. The results are based on deriving lower-bounds for descent algorithms on symmetric neural networks without explicit knowledge of the target function beyond its INAL.
    Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks. (arXiv:2105.12374v2 [cs.LG] UPDATED)
    Continual learning is essential for all real-world applications, as frozen pre-trained models cannot effectively deal with non-stationary data distributions. The purpose of this study is to review the state-of-the-art methods that allow continuous learning of computational models over time. We primarily focus on the learning algorithms that perform continuous learning in an online fashion from considerably large (or infinite) sequential data and require substantially low computational and memory resources. We critically analyze the key challenges associated with continual learning for autonomous real-world systems and compare current methods in terms of computations, memory, and network/model complexity. We also briefly describe the implementations of continuous learning algorithms under three main autonomous systems, i.e., self-driving vehicles, unmanned aerial vehicles, and urban robots. The learning methods of these autonomous systems and their strengths and limitations are extensively explored in this article.
    Behaviorally Grounded Model-Based and Model Free Cost Reduction in a Simulated Multi-Echelon Supply Chain. (arXiv:2202.12786v1 [cs.LG])
    Amplification and phase shift in ordering signals, commonly referred to as bullwhip, are responsible for both excessive strain on real world inventory management systems, stock outs, and unnecessary capital reservation though safety stock building. Bullwhip is a classic, yet persisting, problem with reverberating consequences in inventory management. Research on bullwhip has consistently emphasized behavioral influences for this phenomenon and leveraged behavioral ordering models to suggest interventions. However more recent model-free approaches have also seen success. In this work, the author develops algorithmic approaches towards mitigating bullwhip using both behaviorally grounded model-based approaches alongside a model-free dual deep Q-network reinforcement learning approach. In addition to exploring the utility of this specific model-free architecture to multi-echelon supply chains with imperfect information sharing and information delays, the author directly compares the performance of these model-based and model-free approaches. In doing so, this work highlights both the insights gained from exploring model-based approaches in the context of prior behavioral operations management literature and emphasizes the complementary nature of model-based and model-free approaches in approaching behaviorally grounded supply chain management problems.
    Equilibrium Aggregation: Encoding Sets via Optimization. (arXiv:2202.12795v1 [cs.LG])
    Processing sets or other unordered, potentially variable-sized inputs in neural networks is usually handled by \emph{aggregating} a number of input tensors into a single representation. While a number of aggregation methods already exist from simple sum pooling to multi-head attention, they are limited in their representational power both from theoretical and empirical perspectives. On the search of a principally more powerful aggregation strategy, we propose an optimization-based method called Equilibrium Aggregation. We show that many existing aggregation methods can be recovered as special cases of Equilibrium Aggregation and that it is provably more efficient in some important cases. Equilibrium Aggregation can be used as a drop-in replacement in many existing architectures and applications. We validate its efficiency on three different tasks: median estimation, class counting, and molecular property prediction. In all experiments, Equilibrium Aggregation achieves higher performance than the other aggregation techniques we test.
    Detection as Regression: Certified Object Detection by Median Smoothing. (arXiv:2007.03730v4 [cs.CV] UPDATED)
    Despite the vulnerability of object detectors to adversarial attacks, very few defenses are known to date. While adversarial training can improve the empirical robustness of image classifiers, a direct extension to object detection is very expensive. This work is motivated by recent progress on certified classification by randomized smoothing. We start by presenting a reduction from object detection to a regression problem. Then, to enable certified regression, where standard mean smoothing fails, we propose median smoothing, which is of independent interest. We obtain the first model-agnostic, training-free, and certified defense for object detection against $\ell_2$-bounded attacks. The code for all experiments in the paper is available at this http URL .
    Towards Teachable Autotelic Agents. (arXiv:2105.11977v2 [cs.LG] UPDATED)
    Autonomous discovery and direct instruction are two distinct sources of learning in children but education sciences demonstrate that mixed approaches such as assisted discovery or guided play result in improved skill acquisition. In the field of Artificial Intelligence, these extremes respectively map to autonomous agents learning from their own signals and interactive learning agents fully taught by their teachers. In between should stand teachable autotelic agents (TAA): agents that learn from both internal and teaching signals to benefit from the higher efficiency of assisted discovery. Designing such agents will enable real-world non-expert users to orient the learning trajectories of agents towards their expectations. More fundamentally, this may also be a key step to build agents with human-level intelligence. This paper presents a roadmap towards the design of teachable autonomous agents. Building on developmental psychology and education sciences, we start by identifying key features enabling assisted discovery processes in child-tutor interactions. This leads to the production of a checklist of features that future TAA will need to demonstrate. The checklist allows us to precisely pinpoint the various limitations of current reinforcement learning agents and to identify the promising first steps towards TAA. It also shows the way forward by highlighting key research directions towards the design or autonomous agents that can be taught by ordinary people via natural pedagogy.
    A Systematic Literature Review about Idea Mining: The Use of Machine-driven Analytics to Generate Ideas. (arXiv:2202.12826v1 [cs.IR])
    Idea generation is the core activity of innovation. Digital data sources, which are sources of innovation, such as patents, publications, social media, websites, etc., are increasingly growing at unprecedented volume. Manual idea generation is time-consuming and is affected by the subjectivity of the individuals involved. Therefore, the use machine-driven data analytics techniques to analyze data to generate ideas and support idea generation by serving users is useful. The objective of this study is to study state-of the-art machine-driven analytics for idea generation and data sources, hence the result of this study will generally server as a guideline for choosing techniques and data sources. A systematic literature review is conducted to identify relevant scholarly literature from IEEE, Scopus, Web of Science and Google Scholar. We selected a total of 71 articles and analyzed them thematically. The results of this study indicate that idea generation through machine-driven analytics applies text mining, information retrieval (IR), artificial intelligence (AI), deep learning, machine learning, statistical techniques, natural language processing (NLP), NLP-based morphological analysis, network analysis, and bibliometric to support idea generation. The results include a list of techniques and procedures in idea generation through machine-driven idea analytics. Additionally, characterization and heuristics used in idea generation are summarized. For the future, tools designed to generate ideas could be explored.
    On Monocular Depth Estimation and Uncertainty Quantification using Classification Approaches for Regression. (arXiv:2202.12369v1 [cs.CV])
    Monocular depth is important in many tasks, such as 3D reconstruction and autonomous driving. Deep learning based models achieve state-of-the-art performance in this field. A set of novel approaches for estimating monocular depth consists of transforming the regression task into a classification one. However, there is a lack of detailed descriptions and comparisons for Classification Approaches for Regression (CAR) in the community and no in-depth exploration of their potential for uncertainty estimation. To this end, this paper will introduce a taxonomy and summary of CAR approaches, a new uncertainty estimation solution for CAR, and a set of experiments on depth accuracy and uncertainty quantification for CAR-based models on KITTI dataset. The experiments reflect the differences in the portability of various CAR methods on two backbones. Meanwhile, the newly proposed method for uncertainty estimation can outperform the ensembling method with only one forward propagation.
    Towards Safe, Real-Time Systems: Stereo vs Images and LiDAR for 3D Object Detection. (arXiv:2202.12773v1 [cs.CV])
    As object detectors rapidly improve, attention has expanded past image-only networks to include a range of 3D and multimodal frameworks, especially ones that incorporate LiDAR. However, due to cost, logistics, and even some safety considerations, stereo can be an appealing alternative. Towards understanding the efficacy of stereo as a replacement for monocular input or LiDAR in object detectors, we show that multimodal learning with traditional disparity algorithms can improve image-based results without increasing the number of parameters, and that learning over stereo error can impart similar 3D localization power to LiDAR in certain contexts. Furthermore, doing so also has calibration benefits with respect to image-only methods. We benchmark on the public dataset KITTI, and in doing so, reveal a few small but common algorithmic mistakes currently used in computing metrics on that set, and offer efficient, provably correct alternatives.
    Microgrid Day-Ahead Scheduling Considering Neural Network based Battery Degradation Model. (arXiv:2202.12416v1 [eess.SP])
    Battery energy storage system (BESS) can effectively mitigate the uncertainty of variable renewable generation. Degradation is un-preventable for batteries such as the most popular Lithium-ion battery (LiB). The main causes of LiB degradation are loss of Li-ions, loss of electrolyte, and increase of internal resistance which are hard to model and predict. In this paper, we propose a data driven method to predict the battery degradation per a given scheduled battery operational profile. Particularly, a neural net-work based battery degradation (NNBD) model is proposed to quantify the battery degradation with inputs of major battery degradation factors. When incorporating the proposed NNBD model into microgrid day-ahead scheduling (MDS), we can estab-lish a battery degradation based MDS (BDMDS) model that can consider the equivalent battery degradation cost precisely. Since the proposed NNBD model is highly non-linear and non-convex, BDMDS would be very hard to solve. To address this issue, a neural network and optimization decoupled heuristic (NNODH) algorithm is proposed in this paper to effectively solve this neural network embedded optimization problem. Simulation results demonstrate that the proposed NNODH algorithm is able to ob-tain the optimal solution with lowest total cost including normal operation cost and battery degradation cost.
    Consolidated Adaptive T-soft Update for Deep Reinforcement Learning. (arXiv:2202.12504v1 [cs.LG])
    Demand for deep reinforcement learning (DRL) is gradually increased to enable robots to perform complex tasks, while DRL is known to be unstable. As a technique to stabilize its learning, a target network that slowly and asymptotically matches a main network is widely employed to generate stable pseudo-supervised signals. Recently, T-soft update has been proposed as a noise-robust update rule for the target network and has contributed to improving the DRL performance. However, the noise robustness of T-soft update is specified by a hyperparameter, which should be tuned for each task, and is deteriorated by a simplified implementation. This study develops adaptive T-soft (AT-soft) update by utilizing the update rule in AdaTerm, which has been developed recently. In addition, the concern that the target network does not asymptotically match the main network is mitigated by a new consolidation for bringing the main network back to the target network. This so-called consolidated AT-soft (CAT-soft) update is verified through numerical simulations.
    Building a 3-Player Mahjong AI using Deep Reinforcement Learning. (arXiv:2202.12847v1 [cs.AI])
    Mahjong is a popular multi-player imperfect-information game developed in China in the late 19th-century, with some very challenging features for AI research. Sanma, being a 3-player variant of the Japanese Riichi Mahjong, possesses unique characteristics including fewer tiles and, consequently, a more aggressive playing style. It is thus challenging and of great research interest in its own right, but has not yet been explored. In this paper, we present Meowjong, an AI for Sanma using deep reinforcement learning. We define an informative and compact 2-dimensional data structure for encoding the observable information in a Sanma game. We pre-train 5 convolutional neural networks (CNNs) for Sanma's 5 actions -- discard, Pon, Kan, Kita and Riichi, and enhance the major action's model, namely the discard model, via self-play reinforcement learning using the Monte Carlo policy gradient method. Meowjong's models achieve test accuracies comparable with AIs for 4-player Mahjong through supervised learning, and gain a significant further enhancement from reinforcement learning. Being the first ever AI in Sanma, we claim that Meowjong stands as a state-of-the-art in this game.
    High-Dimensional Sparse Bayesian Learning without Covariance Matrices. (arXiv:2202.12808v1 [eess.SP])
    Sparse Bayesian learning (SBL) is a powerful framework for tackling the sparse coding problem. However, the most popular inference algorithms for SBL become too expensive for high-dimensional settings, due to the need to store and compute a large covariance matrix. We introduce a new inference scheme that avoids explicit construction of the covariance matrix by solving multiple linear systems in parallel to obtain the posterior moments for SBL. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory, especially for structured dictionaries capable of fast matrix-vector multiplication.
    Stacked Residuals of Dynamic Layers for Time Series Anomaly Detection. (arXiv:2202.12457v1 [cs.LG])
    We present an end-to-end differentiable neural network architecture to perform anomaly detection in multivariate time series by incorporating a Sequential Probability Ratio Test on the prediction residual. The architecture is a cascade of dynamical systems designed to separate linearly predictable components of the signal such as trends and seasonality, from the non-linear ones. The former are modeled by local Linear Dynamic Layers, and their residual is fed to a generic Temporal Convolutional Network that also aggregates global statistics from different time series as context for the local predictions of each one. The last layer implements the anomaly detector, which exploits the temporal structure of the prediction residuals to detect both isolated point anomalies and set-point changes. It is based on a novel application of the classic CUMSUM algorithm, adapted through the use of a variational approximation of f-divergences. The model automatically adapts to the time scales of the observed signals. It approximates a SARIMA model at the get-go, and auto-tunes to the statistics of the signal and its covariates, without the need for supervision, as more data is observed. The resulting system, which we call STRIC, outperforms both state-of-the-art robust statistical methods and deep neural network architectures on multiple anomaly detection benchmarks.
    Optimal channel selection with discrete QCQP. (arXiv:2202.12417v1 [cs.CV])
    Reducing the high computational cost of large convolutional neural networks is crucial when deploying the networks to resource-constrained environments. We first show the greedy approach of recent channel pruning methods ignores the inherent quadratic coupling between channels in the neighboring layers and cannot safely remove inactive weights during the pruning procedure. Furthermore, due to these inactive weights, the greedy methods cannot guarantee to satisfy the given resource constraints and deviate with the true objective. In this regard, we propose a novel channel selection method that optimally selects channels via discrete QCQP, which provably prevents any inactive weights and guarantees to meet the resource constraints tightly in terms of FLOPs, memory usage, and network size. We also propose a quadratic model that accurately estimates the actual inference time of the pruned network, which allows us to adopt inference time as a resource constraint option. Furthermore, we generalize our method to extend the selection granularity beyond channels and handle non-sequential connections. Our experiments on CIFAR-10 and ImageNet show our proposed pruning method outperforms other fixed-importance channel pruning methods on various network architectures.
    Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice. (arXiv:2202.12780v1 [stat.ML])
    One of the main tasks of actuaries and data scientists is to build good predictive models for certain phenomena such as the claim size or the number of claims in insurance. These models ideally exploit given feature information to enhance the accuracy of prediction. This user guide revisits and clarifies statistical techniques to assess the calibration or adequacy of a model on the one hand, and to compare and rank different models on the other hand. In doing so, it emphasises the importance of specifying the prediction target at hand a priori and of choosing the scoring function in model comparison in line with this target. Guidance for the practical choice of the scoring function is provided. Striving to bridge the gap between science and daily practice in application, it focuses mainly on the pedagogical presentation of existing results and of best practice. The results are accompanied and illustrated by two real data case studies on workers' compensation and customer churn.
    Benchmarking Generative Latent Variable Models for Speech. (arXiv:2202.12707v1 [eess.AS])
    Stochastic latent variable models (LVMs) achieve state-of-the-art performance on natural image generation but are still inferior to deterministic models on speech. In this paper, we develop a speech benchmark of popular temporal LVMs and compare them against state-of-the-art deterministic models. We report the likelihood, which is a much used metric in the image domain, but rarely, and often incomparably, reported for speech models. To assess the quality of the learned representations, we also compare their usefulness for phoneme recognition. Finally, we adapt the Clockwork VAE, a state-of-the-art temporal LVM for video generation, to the speech domain. Despite being autoregressive only in latent space, we find that the Clockwork VAE can outperform previous LVMs and reduce the gap to deterministic models by using a hierarchy of latent variables.
    Learning Relative Return Policies With Upside-Down Reinforcement Learning. (arXiv:2202.12742v1 [cs.LG])
    Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method -- upside-down reinforcement learning -- to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.
    Improving generalization with synthetic training data for deep learning based quality inspection. (arXiv:2202.12818v1 [cs.CV])
    Automating quality inspection with computer vision techniques is often a very data-demanding task. Specifically, supervised deep learning requires a large amount of annotated images for training. In practice, collecting and annotating such data is not only costly and laborious, but also inefficient, given the fact that only a few instances may be available for certain defect classes. If working with video frames can increase the number of these instances, it has a major disadvantage: the resulting images will be highly correlated with one another. As a consequence, models trained under such constraints are expected to be very sensitive to input distribution changes, which may be caused in practice by changes in the acquisition system (cameras, lights), in the parts or in the defects aspect. In this work, we demonstrate the use of randomly generated synthetic training images can help tackle domain instability issues, making the trained models more robust to contextual changes. We detail both our synthetic data generation pipeline and our deep learning methodology for answering these questions.
    Structure-aware Unsupervised Tagged-to-Cine MRI Synthesis with Self Disentanglement. (arXiv:2202.12474v1 [eess.IV])
    Cycle reconstruction regularized adversarial training -- e.g., CycleGAN, DiscoGAN, and DualGAN -- has been widely used for image style transfer with unpaired training data. Several recent works, however, have shown that local distortions are frequent, and structural consistency cannot be guaranteed. Targeting this issue, prior works usually relied on additional segmentation or consistent feature extraction steps that are task-specific. To counter this, this work aims to learn a general add-on structural feature extractor, by explicitly enforcing the structural alignment between an input and its synthesized image. Specifically, we propose a novel input-output image patches self-training scheme to achieve a disentanglement of underlying anatomical structures and imaging modalities. The translator and structure encoder are updated, following an alternating training protocol. In addition, the information w.r.t. imaging modality can be eliminated with an asymmetric adversarial game. We train, validate, and test our network on 1,768, 416, and 1,560 unpaired subject-independent slices of tagged and cine magnetic resonance imaging from a total of twenty healthy subjects, respectively, demonstrating superior performance over competing methods.
    Improved Dual Correlation Reduction Network. (arXiv:2202.12533v1 [cs.CV])
    Deep graph clustering, which aims to reveal the underlying graph structure and divide the nodes into different clusters without human annotations, is a fundamental yet challenging task. However, we observed that the existing methods suffer from the representation collapse problem and easily tend to encode samples with different classes into the same latent embedding. Consequently, the discriminative capability of nodes is limited, resulting in sub-optimal clustering performance. To address this problem, we propose a novel deep graph clustering algorithm termed Improved Dual Correlation Reduction Network (IDCRN) through improving the discriminative capability of samples. Specifically, by approximating the cross-view feature correlation matrix to an identity matrix, we reduce the redundancy between different dimensions of features, thus improving the discriminative capability of the latent space explicitly. Meanwhile, the cross-view sample correlation matrix is forced to approximate the designed clustering-refined adjacency matrix to guide the learned latent representation to recover the affinity matrix even across views, thus enhancing the discriminative capability of features implicitly. Moreover, we avoid the collapsed representation caused by the over-smoothing issue in Graph Convolutional Networks (GCNs) through an introduced propagation regularization term, enabling IDCRN to capture the long-range information with the shallow network structure. Extensive experimental results on six benchmarks have demonstrated the effectiveness and the efficiency of IDCRN compared to the existing state-of-the-art deep graph clustering algorithms.
    Statistics and Deep Learning-based Hybrid Model for Interpretable Anomaly Detection. (arXiv:2202.12720v1 [cs.LG])
    Hybrid methods have been shown to outperform pure statistical and pure deep learning methods at both forecasting tasks, and at quantifying the uncertainty associated with those forecasts (prediction intervals). One example is Multivariate Exponential Smoothing Long Short-Term Memory (MES-LSTM), a hybrid between a multivariate statistical forecasting model and a Recurrent Neural Network variant, Long Short-Term Memory. It has also been shown that a model that ($i$) produces accurate forecasts and ($ii$) is able to quantify the associated predictive uncertainty satisfactorily, can be successfully adapted to a model suitable for anomaly detection tasks. With the increasing ubiquity of multivariate data and new application domains, there have been numerous anomaly detection methods proposed in recent years. The proposed methods have largely focused on deep learning techniques, which are prone to suffer from challenges such as ($i$) large sets of parameters that may be computationally intensive to tune, $(ii)$ returning too many false positives rendering the techniques impractical for use, $(iii)$ requiring labeled datasets for training which are often not prevalent in real life, and ($iv$) understanding of the root causes of anomaly occurrences inhibited by the predominantly black-box nature of deep learning methods. In this article, an extension of MES-LSTM is presented, an interpretable anomaly detection model that overcomes these challenges. With a focus on renewable energy generation as an application domain, the proposed approach is benchmarked against the state-of-the-art. The findings are that MES-LSTM anomaly detector is at least competitive to the benchmarks at anomaly detection tasks, and less prone to learning from spurious effects than the benchmarks, thus making it more reliable at root cause discovery and explanation.
    MUC-driven Feature Importance Measurement and Adversarial Analysis for Random Forest. (arXiv:2202.12512v1 [cs.LG])
    The broad adoption of Machine Learning (ML) in security-critical fields demands the explainability of the approach. However, the research on understanding ML models, such as Random Forest (RF), is still in its infant stage. In this work, we leverage formal methods and logical reasoning to develop a novel model-specific method for explaining the prediction of RF. Our approach is centered around Minimal Unsatisfiable Cores (MUC) and provides a comprehensive solution for feature importance, covering local and global aspects, and adversarial sample analysis. Experimental results on several datasets illustrate the high quality of our feature importance measurement. We also demonstrate that our adversarial analysis outperforms the state-of-the-art method. Moreover, our method can produce a user-centered report, which helps provide recommendations in real-life applications.
    NeuralKG: An Open Source Library for Diverse Representation Learning of Knowledge Graphs. (arXiv:2202.12571v1 [cs.LG])
    NeuralKG is an open-source Python-based library for diverse representation learning of knowledge graphs. It implements three different series of Knowledge Graph Embedding (KGE) methods, including conventional KGEs, GNN-based KGEs, and Rule-based KGEs. With a unified framework, NeuralKG successfully reproduces link prediction results of these methods on benchmarks, freeing users from the laborious task of reimplementing them, especially for some methods originally written in non-python programming languages. Besides, NeuralKG is highly configurable and extensible. It provides various decoupled modules that can be mixed and adapted to each other. Thus with NeuralKG, developers and researchers can quickly implement their own designed models and obtain the optimal training methods to achieve the best performance efficiently. We built an website in this http URL to organize an open and shared KG representation learning community. The source code is all publicly released at https://github.com/zjukg/NeuralKG.
    Learning POD of Complex Dynamics Using Heavy-ball Neural ODEs. (arXiv:2202.12373v1 [cs.LG])
    Proper orthogonal decomposition (POD) allows reduced-order modeling of complex dynamical systems at a substantial level, while maintaining a high degree of accuracy in modeling the underlying dynamical systems. Advances in machine learning algorithms enable learning POD-based dynamics from data and making accurate and fast predictions of dynamical systems. In this paper, we leverage the recently proposed heavy-ball neural ODEs (HBNODEs) [Xia et al. NeurIPS, 2021] for learning data-driven reduced-order models (ROMs) in the POD context, in particular, for learning dynamics of time-varying coefficients generated by the POD analysis on training snapshots generated from solving full order models. HBNODE enjoys several practical advantages for learning POD-based ROMs with theoretical guarantees, including 1) HBNODE can learn long-term dependencies effectively from sequential observations and 2) HBNODE is computationally efficient in both training and testing. We compare HBNODE with other popular ROMs on several complex dynamical systems, including the von K\'{a}rm\'{a}n Street flow, the Kurganov-Petrova-Popov equation, and the one-dimensional Euler equations for fluids modeling.
    Multi-Head ReLU Implicit Neural Representation Networks. (arXiv:2110.03448v2 [cs.LG] UPDATED)
    In this paper, a novel multi-head multi-layer perceptron (MLP) structure is presented for implicit neural representation (INR). Since conventional rectified linear unit (ReLU) networks are shown to exhibit spectral bias towards learning low-frequency features of the signal, we aim at mitigating this defect by taking advantage of the local structure of the signals. To be more specific, an MLP is used to capture the global features of the underlying generator function of the desired signal. Then, several heads are utilized to reconstruct disjoint local features of the signal, and to reduce the computational complexity, sparse layers are deployed for attaching heads to the body. Through various experiments, we show that the proposed model does not suffer from the special bias of conventional ReLU networks and has superior generalization capabilities. Finally, simulation results confirm that the proposed multi-head structure outperforms existing INR methods with considerably less computational cost.
    Deep Learning, Natural Language Processing, and Explainable Artificial Intelligence in the Biomedical Domain. (arXiv:2202.12678v1 [cs.AI])
    In this article, we first give an introduction to artificial intelligence and its applications in biology and medicine in Section 1. Deep learning methods are then described in Section 2. We narrow down the focus of the study on textual data in Section 3, where natural language processing and its applications in the biomedical domain are described. In Section 4, we give an introduction to explainable artificial intelligence and discuss the importance of explainability of artificial intelligence systems, especially in the biomedical domain.
    Bridging the Gap Between Patient-specific and Patient-independent Seizure Prediction via Knowledge Distillation. (arXiv:2202.12598v1 [cs.LG])
    Objective. Deep neural networks (DNN) have shown unprecedented success in various brain-machine interface (BCI) applications such as epileptic seizure prediction. However, existing approaches typically train models in a patient-specific fashion due to the highly personalized characteristics of epileptic signals. Therefore, only a limited number of labeled recordings from each subject can be used for training. As a consequence, current DNN based methods demonstrate poor generalization ability to some extent due to the insufficiency of training data. On the other hand, patient-independent models attempt to utilize more patient data to train a universal model for all patients by pooling patient data together. Despite different techniques applied, results show that patient-independent models perform worse than patient-specific models due to high individual variation across patients. A substantial gap thus exists between patient-specific and patient-independent models. In this paper, we propose a novel training scheme based on knowledge distillation which makes use of a large amount of data from multiple subjects. It first distills informative features from signals of all available subjects with a pre-trained general model. A patient-specific model can then be obtained with the help of distilled knowledge and additional personalized data. Significance. The proposed training scheme significantly improves the performance of patient-specific seizure predictors and bridges the gap between patient-specific and patient-independent predictors. Five state-of-the-art seizure prediction methods are trained on the CHB-MIT sEEG database with our proposed scheme. The resulting accuracy, sensitivity, and false prediction rate show that our proposed training scheme consistently improves the prediction performance of state-of-the-art methods by a large margin.
    Towards Better Meta-Initialization with Task Augmentation for Kindergarten-aged Speech Recognition. (arXiv:2202.12326v1 [eess.AS])
    Children's automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. However, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children's ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children's speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization.
    Prediction of Depression Severity Based on the Prosodic and Semantic Features with Bidirectional LSTM and Time Distributed CNN. (arXiv:2202.12456v1 [cs.HC])
    Depression is increasingly impacting individuals both physically and psychologically worldwide. It has become a global major public health problem and attracts attention from various research fields. Traditionally, the diagnosis of depression is formulated through semi-structured interviews and supplementary questionnaires, which makes the diagnosis heavily relying on physicians experience and is subject to bias. Mental health monitoring and cloud-based remote diagnosis can be implemented through an automated depression diagnosis system. In this article, we propose an attention-based multimodality speech and text representation for depression prediction. Our model is trained to estimate the depression severity of participants using the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset. For the audio modality, we use the collaborative voice analysis repository (COVAREP) features provided by the dataset and employ a Bidirectional Long Short-Term Memory Network (Bi-LSTM) followed by a Time-distributed Convolutional Neural Network (T-CNN). For the text modality, we use global vectors for word representation (GloVe) to perform word embeddings and the embeddings are fed into the Bi-LSTM network. Results show that both audio and text models perform well on the depression severity estimation task, with best sequence level F1 score of 0.9870 and patient-level F1 score of 0.9074 for the audio model over five classes (healthy, mild, moderate, moderately severe, and severe), as well as sequence level F1 score of 0.9709 and patient-level F1 score of 0.9245 for the text model over five classes. Results are similar for the multimodality fused model, with the highest F1 score of 0.9580 on the patient-level depression detection task over five classes. Experiments show statistically significant improvements over previous works.
    Thompson Sampling with Unrestricted Delays. (arXiv:2202.12431v1 [cs.LG])
    We investigate properties of Thompson Sampling in the stochastic multi-armed bandit problem with delayed feedback. In a setting with i.i.d delays, we establish to our knowledge the first regret bounds for Thompson Sampling with arbitrary delay distributions, including ones with unbounded expectation. Our bounds are qualitatively comparable to the best available bounds derived via ad-hoc algorithms, and only depend on delays via selected quantiles of the delay distributions. Furthermore, in extensive simulation experiments, we find that Thompson Sampling outperforms a number of alternative proposals, including methods specifically designed for settings with delayed feedback.
    On the Effectiveness of Dataset Watermarking in Adversarial Settings. (arXiv:2202.12506v1 [cs.CR])
    In a data-driven world, datasets constitute a significant economic value. Dataset owners who spend time and money to collect and curate the data are incentivized to ensure that their datasets are not used in ways that they did not authorize. When such misuse occurs, dataset owners need technical mechanisms for demonstrating their ownership of the dataset in question. Dataset watermarking provides one approach for ownership demonstration which can, in turn, deter unauthorized use. In this paper, we investigate a recently proposed data provenance method, radioactive data, to assess if it can be used to demonstrate ownership of (image) datasets used to train machine learning (ML) models. The original paper reported that radioactive data is effective in white-box settings. We show that while this is true for large datasets with many classes, it is not as effective for datasets where the number of classes is low $(\leq 30)$ or the number of samples per class is low $(\leq 500)$. We also show that, counter-intuitively, the black-box verification technique is effective for all datasets used in this paper, even when white-box verification is not. Given this observation, we show that the confidence in white-box verification can be improved by using watermarked samples directly during the verification process. We also highlight the need to assess the robustness of radioactive data if it were to be used for ownership demonstration since it is an adversarial setting unlike provenance identification. Compared to dataset watermarking, ML model watermarking has been explored more extensively in recent literature. However, most of the model watermarking techniques can be defeated via model extraction. We show that radioactive data can effectively survive model extraction attacks, which raises the possibility that it can be used for ML model ownership verification robust against model extraction.
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v2 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath ~(ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman~(arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker~(arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.
    How Much of the Chemical Space Has Been Explored? Selecting the Right Exploration Measure for Drug Discovery. (arXiv:2112.12542v2 [cs.CE] UPDATED)
    Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While many aim to optimize particular chemical properties, there is limited literature on how to properly measure and encourage the exploration of the chemical space when generating drug candidates. This problem is challenging due to the lack of formal criteria to select good exploration measures. We propose a novel framework to systematically evaluate exploration measures for drug candidate generation. The procedure is built upon three formal analyses: an axiomatic analysis that validates the potential measures analytically, an empirical analysis that compares the correlations of the measures to a proxy gold standard, and a practical analysis that benchmarks the effectiveness of the measures in an optimization procedure of molecular generation. We are able to evaluate a wide range of potential exploration measures under this framework and make recommendations on existing and novel exploration measures that are suitable for the task of drug discovery.
    Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration. (arXiv:2202.12412v1 [cs.CV])
    Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work, we successfully harden a model against Fourier-based attacks, while producing superior-to-AugMix accuracy and calibration results on both the CIFAR-10-C and CIFAR-100-C datasets; classification error is reduced by over ten percentage points for some high-severity noise and digital-type corruptions. We achieve this by incorporating Fourier-basis perturbations in the AugMix image-augmentation framework. Thus we demonstrate that the AugMix framework can be tailored to effectively target particular distribution shifts, while boosting overall model robustness.
    A Deep Learning Approach for Network-wide Dynamic Traffic Prediction during Hurricane Evacuation. (arXiv:2202.12505v1 [cs.LG])
    Proactive evacuation traffic management largely depends on real-time monitoring and prediction of traffic flow at a high spatiotemporal resolution. However, evacuation traffic prediction is challenging due to the uncertainties caused by sudden changes in projected hurricane paths and consequently household evacuation behavior. Moreover, modeling spatiotemporal traffic flow patterns requires extensive data over a longer time period, whereas evacuations typically last for 2 to 5 days. In this paper, we present a novel data-driven approach for predicting evacuation traffic at a network scale. We develop a dynamic graph convolution LSTM (DGCN-LSTM) model to learn the network dynamics of hurricane evacuation. We first train the model for non-evacuation period traffic data showing that the model outperforms existing deep learning models for predicting non-evacuation period traffic with an RMSE value of 226.84. However, when we apply the model for evacuation period, the RMSE value increased to 1440.99. We overcome this issue by adopting a transfer learning approach with additional features related to evacuation traffic demand such as distance from the evacuation zone, time to landfall, and other zonal level features to control the transfer of information (network dynamics) from non-evacuation periods to evacuation periods. The final transfer learned DGCN-LSTM model performs well to predict evacuation traffic flow (RMSE=399.69). The implemented model can be applied to predict evacuation traffic over a longer forecasting horizon (6 hour). It will assist transportation agencies to activate appropriate traffic management strategies to reduce delays for evacuating traffic.
    Exploiting auto-encoders for middle-level explanations of image classification systems. (arXiv:2106.05037v3 [cs.LG] UPDATED)
    A central issue addressed by the rapidly growing research area of eXplainable Artificial Intelligence (XAI) is to provide methods to give explanations for the behaviours of Machine Learning (ML) non-interpretable models after the training. Recently, it is becoming more and more evident that new directions to create better explanations should take into account what a good explanation is to a human user. This paper suggests taking advantage of developing an XAI framework that allows producing multiple explanations for the response of image a classification system in terms of potentially different middle-level input features. To this end, we propose an XAI framework able to construct explanations in terms of input features extracted by auto-encoders. We start from the hypothesis that some autoencoders, relying on standard data representation approaches, could extract more salient and understandable input properties, which we call here \textit{Middle-Level input Features} (MLFs), for a user with respect to raw low-level features. Furthermore, extracting different types of MLFs through different type of autoencoders, different types of explanations for the same ML system behaviour can be returned. We experimentally tested our method on two different image datasets and using three different types of MLFs. The results are encouraging. Although our novel approach was tested in the context of image classification, it can potentially be used on other data types to the extent that auto-encoders to extract humanly understandable representations can be applied.
    Up to 100$\times$ Faster Data-free Knowledge Distillation. (arXiv:2112.06253v2 [cs.LG] UPDATED)
    Data-free knowledge distillation (DFKD) has recently been attracting increasing attention from research communities, attributed to its capability to compress a model only using synthetic data. Despite the encouraging results achieved, state-of-the-art DFKD methods still suffer from the inefficiency of data synthesis, making the data-free training process extremely time-consuming and thus inapplicable for large-scale tasks. In this work, we introduce an efficacious scheme, termed as FastDFKD, that allows us to accelerate DFKD by a factor of orders of magnitude. At the heart of our approach is a novel strategy to reuse the shared common features in training data so as to synthesize different data instances. Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features as the initialization for the fast data synthesis. As a result, FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training. Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10$\times$ and even 100$\times$ acceleration while preserving performances on par with state of the art. Code is available at \url{https://github.com/zju-vipa/Fast-Datafree}.
    Spatio-Temporal Latent Graph Structure Learning for Traffic Forecasting. (arXiv:2202.12586v1 [cs.LG])
    Accurate traffic forecasting, the foundation of intelligent transportation systems (ITS), has never been more significant than nowadays due to the prosperity of the smart cities and urban computing. Recently, Graph Neural Network truly outperforms the traditional methods. Nevertheless, the most conventional GNN based model works well while given a pre-defined graph structure. And the existing methods of defining the graph structures focus purely on spatial dependencies and ignored the temporal correlation. Besides, the semantics of the static pre-defined graph adjacency applied during the whole training progress is always incomplete, thus overlooking the latent topologies that may fine-tune the model. To tackle these challenges, we proposed a new traffic forecasting framework--Spatio-Temporal Latent Graph Structure Learning networks (ST-LGSL). More specifically, the model employed a graph generator based on Multilayer perceptron and K-Nearest Neighbor, which learns the latent graph topological information from the entire data considering both spatial and temporal dynamics. Furthermore, with the initialization of MLP-kNN based on ground-truth adjacency matrix and similarity metric in kNN, ST-LGSL aggregates the topologies focusing on geography and node similarity. Additionally, the generated graphs act as the input of spatio-temporal prediction module combined with the Diffusion Graph Convolutions and Gated Temporal Convolutions Networks. Experimental results on two benchmarking datasets in real world demonstrate that ST-LGSL outperforms various types of state-of-art baselines.
    Gen\'eLive! Generating Rhythm Actions in Love Live!. (arXiv:2202.12823v1 [cs.LG])
    A rhythm action game is a music-based video game in which the player is challenged to issue commands at the right timings during a music session. The timings are rendered in the chart, which consists of visual symbols, called notes, flying through the screen. KLab Inc., a Japan-based video game developer, has operated rhythm action games including a title for the "Love Live!" franchise, which became a hit across Asia and beyond. Before this work, the company generated the charts manually, which resulted in a costly business operation. This paper presents how KLab applied a deep generative model for synthesizing charts, and shows how it has improved the chart production process, reducing the business cost by half. Existing generative models generated poor quality charts for easier difficulty modes. We report how we overcame this challenge through a multi-scaling model dedicated to rhythm actions, by considering beats among other things. Our model, named Gen\'eLive!, is evaluated using production datasets at KLab as well as open datasets.
    Matrix factorisation and the interpretation of geodesic distance. (arXiv:2106.01260v2 [stat.ML] UPDATED)
    Given a graph or similarity matrix, we consider the problem of recovering a notion of true distance between the nodes, and so their true positions. We show that this can be accomplished in two steps: matrix factorisation, followed by nonlinear dimension reduction. This combination is effective because the point cloud obtained in the first step lives close to a manifold in which latent distance is encoded as geodesic distance. Hence, a nonlinear dimension reduction tool, approximating geodesic distance, can recover the latent positions, up to a simple transformation. We give a detailed account of the case where spectral embedding is used, followed by Isomap, and provide encouraging experimental evidence for other combinations of techniques.
    Evolutionary scheduling of university activities based on consumption forecasts to minimise electricity costs. (arXiv:2202.12595v1 [cs.LG])
    This paper presents a solution to a predict then optimise problem which goal is to reduce the electricity cost of a university campus. The proposed methodology combines a multi-dimensional time series forecast and a novel approach to large-scale optimization. Gradient-boosting method is applied to forecast both generation and consumption time-series of the Monash university campus for the month of November 2020. For the consumption forecasts we employ log transformation to model trend and stabilize variance. Additional seasonality and trend features are added to the model inputs when applicable. The forecasts obtained are used as the base load for the schedule optimisation of university activities and battery usage. The goal of the optimisation is to minimize the electricity cost consisting of the price of electricity and the peak electricity tariff both altered by the load from class activities and battery use as well as the penalty of not scheduling some optional activities. The schedule of the class activities is obtained through evolutionary optimisation using the covariance matrix adaptation evolution strategy and the genetic algorithm. This schedule is then improved through local search by testing possible times for each activity one-by-one. The battery schedule is formulated as a mixed-integer programming problem and solved by the Gurobi solver. This method obtains the second lowest cost when evaluated against 6 other methods presented at an IEEE competition that all used mixed-integer programming and the Gurobi solver to schedule both the activities and the battery use.
    Convergence of a New Learning Algorithm. (arXiv:2202.12829v1 [cs.NE])
    A new learning algorithm proposed by Brandt and Lin for neural network [1], [2] has been shown to be mathematically equivalent to the conventional back-propagation learning algorithm, but has several advantages over the backpropagation algorithm, including feedback-network-free implementation and biological plausibility. In this paper, we investigate the convergence of the new algorithm. A necessary and sufficient condition for the algorithm to converge is derived. A convergence measure is proposed to measure the convergence rate of the new algorithm. Simulation studies are conducted to investigate the convergence of the algorithm with respect to the number of neurons, the connection distance, the connection density, the ratio of excitatory/inhibitory synapses, the membrane potentials, and the synapse strengths.
    Efficient Neural Causal Discovery without Acyclicity Constraints. (arXiv:2107.10483v3 [cs.LG] UPDATED)
    Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields. A promising direction is continuous optimization for score-based methods, which, however, require constrained optimization to enforce acyclicity or lack convergence guarantees. In this paper, we present ENCO, an efficient structure learning method for directed, acyclic causal graphs leveraging observational and interventional data. ENCO formulates the graph search as an optimization of independent edge likelihoods, with the edge orientation being modeled as a separate parameter. Consequently, we can provide convergence guarantees of ENCO under mild conditions without constraining the score function with respect to acyclicity. In experiments, we show that ENCO can efficiently recover graphs with hundreds of nodes, an order of magnitude larger than what was previously possible, while handling deterministic variables and latent confounders.
    Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform. (arXiv:2110.05802v2 [cs.LG] UPDATED)
    Obtaining standardized crowdsourced benchmark of computational methods is a major issue in data science communities. Dedicated frameworks enabling fair benchmarking in a unified environment are yet to be developed. Here we introduce Codabench, an open-source, community-driven platform for benchmarking algorithms or software agents versus datasets or tasks. A public instance of Codabench (https://www.codabench.org/) is open to everyone, free of charge, and allows benchmark organizers to compare fairly submissions, under the same setting (software, hardware, data, algorithms), with custom protocols and data formats. Codabench has unique features facilitating the organization of benchmarks flexibly, easily and reproducibly, such as the possibility of re-using templates of benchmarks, and supplying compute resources on-demand. Codabench has been used internally and externally on various applications, receiving more than 130 users and 2500 submissions. As illustrative use cases, we introduce 4 diverse benchmarks covering Graph Machine Learning, Cancer Heterogeneity, Clinical Diagnosis and Reinforcement Learning.
    PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine. (arXiv:2202.12674v1 [cs.LG])
    Machine learning algorithms must be able to efficiently cope with massive data sets. Therefore, they have to scale well on any modern system and be able to exploit the computing power of accelerators independent of their vendor. In the field of supervised learning, Support Vector Machines (SVMs) are widely used. However, even modern and optimized implementations such as LIBSVM or ThunderSVM do not scale well for large non-trivial dense data sets on cutting-edge hardware: Most SVM implementations are based on Sequential Minimal Optimization, an optimized though inherent sequential algorithm. Hence, they are not well-suited for highly parallel GPUs. Furthermore, we are not aware of a performance portable implementation that supports CPUs and GPUs from different vendors. We have developed the PLSSVM library to solve both issues. First, we resort to the formulation of the SVM as a least squares problem. Training an SVM then boils down to solving a system of linear equations for which highly parallel algorithms are known. Second, we provide a hardware independent yet efficient implementation: PLSSVM uses different interchangeable backends--OpenMP, CUDA, OpenCL, SYCL--supporting modern hardware from various vendors like NVIDIA, AMD, or Intel on multiple GPUs. PLSSVM can be used as a drop-in replacement for LIBSVM. We observe a speedup on CPUs of up to 10 compared to LIBSVM and on GPUs of up to 14 compared to ThunderSVM. Our implementation scales on many-core CPUs with a parallel speedup of 74.7 on up to 256 CPU threads and on multiple GPUs with a parallel speedup of 3.71 on four GPUs. The code, utility scripts, and the documentation are available on GitHub: https://github.com/SC-SGS/PLSSVM.
    Learning to Schedule Heuristics for the Simultaneous Stochastic Optimization of Mining Complexes. (arXiv:2202.12866v1 [math.OC])
    The simultaneous stochastic optimization of mining complexes (SSOMC) is a large-scale stochastic combinatorial optimization problem that simultaneously manages the extraction of materials from multiple mines and their processing using interconnected facilities to generate a set of final products, while taking into account material supply (geological) uncertainty to manage the associated risk. Although simulated annealing has been shown to outperform comparing methods for solving the SSOMC, early performance might dominate recent performance in that a combination of the heuristics' performance is used to determine which perturbations to apply. This work proposes a data-driven framework for heuristic scheduling in a fully self-managed hyper-heuristic to solve the SSOMC. The proposed learn-to-perturb (L2P) hyper-heuristic is a multi-neighborhood simulated annealing algorithm. The L2P selects the heuristic (perturbation) to be applied in a self-adaptive manner using reinforcement learning to efficiently explore which local search is best suited for a particular search point. Several state-of-the-art agents have been incorporated into L2P to better adapt the search and guide it towards better solutions. By learning from data describing the performance of the heuristics, a problem-specific ordering of heuristics that collectively finds better solutions faster is obtained. L2P is tested on several real-world mining complexes, with an emphasis on efficiency, robustness, and generalization capacity. Results show a reduction in the number of iterations by 30-50% and in the computational time by 30-45%.
    Online handwriting, signature and touch dynamics: tasks and potential applications in the field of security and health. (arXiv:2202.12693v1 [cs.CR])
    Background: An advantageous property of behavioural signals ,e.g. handwriting, in contrast to morphological ones, such as iris, fingerprint, hand geometry, etc., is the possibility to ask a user for a very rich amount of different tasks. Methods: This article summarises recent findings and applications of different handwriting and drawing tasks in the field of security and health. More specifically, it is focused on on-line handwriting and hand-based interaction, i.e. signals that utilise a digitizing device (specific devoted or general-purpose tablet/smartphone) during the realization of the tasks. Such devices permit the acquisition of on-surface dynamics as well as in-air movements in time, thus providing complex and richer information when compared to the conventional pen and paper method. Conclusions: Although the scientific literature reports a wide range of tasks and applications, in this paper, we summarize only those providing competitive results (e.g. in terms of discrimination power) and having a significant impact in the field.
    Virtual Augmentation Supported Contrastive Learning of Sentence Representations. (arXiv:2110.08552v2 [cs.CL] UPDATED)
    Despite profound successes, contrastive representation learning relies on carefully designed data augmentations using domain specific knowledge. This challenge is magnified in natural language processing where no general rules exist for data augmentation due to the discrete nature of natural language. We tackle this challenge by presenting a Virtual augmentation Supported Contrastive Learning of sentence representations (VaSCL). Originating from the interpretation that data augmentation essentially constructs the neighborhoods of each training instance, we in turn utilize the neighborhood to generate effective data augmentations. Leveraging the large training batch size of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors in the representation space. We then define an instance discrimination task regarding this neighborhood and generate the virtual augmentation in an adversarial training manner. We access the performance of VaSCL on a wide range of downstream tasks, and set a new state-of-the-art for unsupervised sentence representation learning.
    DataLab: A Platform for Data Analysis and Intervention. (arXiv:2202.12875v1 [cs.LG])
    Despite data's crucial role in machine learning, most existing tools and research tend to focus on systems on top of existing data rather than how to interpret and manipulate data. In this paper, we propose DataLab, a unified data-oriented platform that not only allows users to interactively analyze the characteristics of data, but also provides a standardized interface for different data processing operations. Additionally, in view of the ongoing proliferation of datasets, \toolname has features for dataset recommendation and global vision analysis that help researchers form a better view of the data ecosystem. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version (e.g., hyponyms replacement), where 728 datasets support various analyses (e.g., with respect to gender bias) with the help of 140M samples annotated by 318 feature functions. DataLab is under active development and will be supported going forward. We have released a web platform, web API, Python SDK, PyPI published package and online documentation, which hopefully, can meet the diverse needs of researchers.
    No Regrets for Learning the Prior in Bandits. (arXiv:2107.06196v2 [cs.LG] UPDATED)
    We propose ${\tt AdaTS}$, a Thompson sampling algorithm that adapts sequentially to bandit tasks that it interacts with. The key idea in ${\tt AdaTS}$ is to adapt to an unknown task prior distribution by maintaining a distribution over its parameters. When solving a bandit task, that uncertainty is marginalized out and properly accounted for. ${\tt AdaTS}$ is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems. We derive upper bounds on its Bayes regret that quantify the loss due to not knowing the task prior, and show that it is small. Our theory is supported by experiments, where ${\tt AdaTS}$ outperforms prior algorithms and works well even in challenging real-world problems.
    Learning Invariant Weights in Neural Networks. (arXiv:2202.12439v1 [stat.ML])
    Assumptions about invariances or symmetries in data can significantly increase the predictive power of statistical models. Many commonly used models in machine learning are constraint to respect certain symmetries in the data, such as translation equivariance in convolutional neural networks, and incorporation of new symmetry types is actively being studied. Yet, efforts to learn such invariances from the data itself remains an open research problem. It has been shown that marginal likelihood offers a principled way to learn invariances in Gaussian Processes. We propose a weight-space equivalent to this approach, by minimizing a lower bound on the marginal likelihood to learn invariances in neural networks resulting in naturally higher performing models.
    Median Optimal Treatment Regimes. (arXiv:2103.01802v2 [stat.ME] UPDATED)
    Optimal treatment regimes are personalized policies for making a treatment decision based on subject characteristics, with the policy chosen to maximize some value. It is common to aim to maximize the mean outcome in the population, via a regime assigning treatment only to those whose mean outcome is higher under treatment versus control. However, the mean can be an unstable measure of centrality, resulting in imprecise statistical procedures, as well as unrobust decisions that can be overly influenced by a small fraction of subjects. In this work, we propose a new median optimal treatment regime that instead treats individuals whose conditional median is higher under treatment. This ensures that optimal decisions for individuals from the same group are not overly influenced either by (i) a small fraction of the group (unlike the mean criterion), or (ii) unrelated subjects from different groups (unlike marginal median/quantile criteria). We introduce a new measure of value, the Average Conditional Median Effect (ACME), which summarizes across-group median treatment outcomes of a policy, and which the median optimal treatment regime maximizes. After developing key motivating examples that distinguish median optimal treatment regimes from mean and marginal median optimal treatment regimes, we give a nonparametric efficiency bound for estimating the ACME of a policy, and propose a new doubly robust-style estimator that achieves the efficiency bound under weak conditions. To construct the median optimal treatment regime, we introduce a new doubly robust-style estimator for the conditional median treatment effect. Finite-sample properties are explored via numerical simulations and the proposed algorithm is illustrated using data from a randomized clinical trial in patients with HIV.
    Understanding Adversarial Robustness from Feature Maps of Convolutional Layers. (arXiv:2202.12435v1 [cs.CV])
    The adversarial robustness of a neural network mainly relies on two factors, one is the feature representation capacity of the network, and the other is its resistance ability to perturbations. In this paper, we study the anti-perturbation ability of the network from the feature maps of convolutional layers. Our theoretical analysis discovers that larger convolutional features before average pooling can contribute to better resistance to perturbations, but the conclusion is not true for max pooling. Based on the theoretical findings, we present two feasible ways to improve the robustness of existing neural networks. The proposed approaches are very simple and only require upsampling the inputs or modifying the stride configuration of convolution operators. We test our approaches on several benchmark neural network architectures, including AlexNet, VGG16, RestNet18 and PreActResNet18, and achieve non-trivial improvements on both natural accuracy and robustness under various attacks. Our study brings new insights into the design of robust neural networks. The code is available at \url{https://github.com/MTandHJ/rcm}.
    On Learning and Testing of Counterfactual Fairness through Data Preprocessing. (arXiv:2202.12440v1 [stat.ML])
    Machine learning has become more important in real-life decision-making but people are concerned about the ethical problems it may bring when used improperly. Recent work brings the discussion of machine learning fairness into the causal framework and elaborates on the concept of Counterfactual Fairness. In this paper, we develop the Fair Learning through dAta Preprocessing (FLAP) algorithm to learn counterfactually fair decisions from biased training data and formalize the conditions where different data preprocessing procedures should be used to guarantee counterfactual fairness. We also show that Counterfactual Fairness is equivalent to the conditional independence of the decisions and the sensitive attributes given the processed non-sensitive attributes, which enables us to detect discrimination in the original decision using the processed data. The performance of our algorithm is illustrated using simulated data and real-world applications.
    6D Rotation Representation For Unconstrained Head Pose Estimation. (arXiv:2202.12555v1 [cs.CV])
    In this paper, we present a method for unconstrained end-to-end head pose estimation. We address the problem of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This way, our method can learn the full rotation appearance which is contrary to previous approaches that restrict the pose prediction to a narrow-angle for satisfactory results. In addition, we propose a geodesic distance-based loss to penalize our network with respect to the SO(3) manifold geometry. Experiments on the public AFLW2000 and BIWI datasets demonstrate that our proposed method significantly outperforms other state-of-the-art methods by up to 20\%. We open-source our training and testing code along with our pre-trained models: https://github.com/thohemp/6DRepNet.
    Ensemble Method for Estimating Individualized Treatment Effects. (arXiv:2202.12445v1 [stat.ME])
    In many medical and business applications, researchers are interested in estimating individualized treatment effects using data from a randomized experiment. For example in medical applications, doctors learn the treatment effects from clinical trials and in technology companies, researchers learn them from A/B testing experiments. Although dozens of machine learning models have been proposed for this task, it is challenging to determine which model will be best for the problem at hand because ground-truth treatment effects are unobservable. In contrast to several recent papers proposing methods to \textit{select} one of these competing models, we propose an algorithm for \textit{aggregating} the estimates from a diverse library of models. We compare ensembling to model selection on 43 benchmark datasets, and find that ensembling wins almost every time. Theoretically, we prove that our ensemble model is (asymptotically) at least as accurate as the best model under consideration, even if the number of candidate models is allowed to grow with the sample size.
    A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits. (arXiv:2108.11345v3 [cs.LG] UPDATED)
    This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals $\rho$ that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytical toolkits, we analyse the algorithm $\rho$-MTS (for multinomial distributions) and prove that they admit asymptotically optimal regret bounds of risk-averse algorithms under the mean-variance, CVaR, and other ubiquitous risk measures. Numerical simulations show that the regret bounds incurred by our algorithms are reasonably tight vis-\`a-vis algorithm-independent lower bounds.
    Learning to Liquidate Forex: Optimal Stopping via Adaptive Top-K Regression. (arXiv:2202.12578v1 [cs.LG])
    We consider learning a trading agent acting on behalf of the treasury of a firm earning revenue in a foreign currency (FC) and incurring expenses in the home currency (HC). The goal of the agent is to maximize the expected HC at the end of the trading episode by deciding to hold or sell the FC at each time step in the trading episode. We pose this as an optimization problem, and consider a broad spectrum of approaches with the learning component ranging from supervised to imitation to reinforcement learning. We observe that most of the approaches considered struggle to improve upon simple heuristic baselines. We identify two key aspects of the problem that render standard solutions ineffective - i) while good forecasts of future FX rates can be highly effective in guiding good decisions, forecasting FX rates is difficult, and erroneous estimates tend to degrade the performance of trading agents instead of improving it, ii) the inherent non-stationary nature of FX rates renders a fixed decision-threshold highly ineffective. To address these problems, we propose a novel supervised learning approach that learns to forecast the top-K future FX rates instead of forecasting all the future FX rates, and bases the hold-versus-sell decision on the forecasts (e.g. hold if future FX rate is higher than current FX rate, sell otherwise). Furthermore, to handle the non-stationarity in the FX rates data which poses challenges to the i.i.d. assumption in supervised learning methods, we propose to adaptively learn decision-thresholds based on recent historical episodes. Through extensive empirical evaluation, we show that our approach is the only approach which is able to consistently improve upon a simple heuristic baseline. Further experiments show the inefficacy of state-of-the-art statistical and deep-learning-based forecasting methods as they degrade the performance of the trading agent.
    safe-control-gym: a Unified Benchmark Suite for Safe Learning-based Control and Reinforcement Learning. (arXiv:2109.06325v3 [cs.RO] UPDATED)
    In recent years, both reinforcement learning and learning-based control -- as well as the study of their safety, which is crucial for deployment in real-world robots -- have gained significant traction. However, to adequately gauge the progress and applicability of new results, we need the tools to equitably compare the approaches proposed by the controls and reinforcement learning communities. Here, we propose a new open-source benchmark suite, called safe-control-gym, supporting both model-based and data-based control techniques. We provide implementations for three dynamic systems -- the cart-pole, the 1D, and 2D quadrotor -- and two control tasks -- stabilization and trajectory tracking. We propose to extend OpenAI's Gym API -- the de facto standard in reinforcement learning research -- with (i) the ability to specify (and query) symbolic dynamics and (ii) constraints, and (iii) (repeatably) inject simulated disturbances in the control inputs, state measurements, and inertial properties. To demonstrate our proposal and in an attempt to bring research communities closer together, we show how to use safe-control-gym to quantitatively compare the control performance, data efficiency, and safety of multiple approaches from the fields of traditional control, learning-based control, and reinforcement learning.
    FedCAT: Towards Accurate Federated Learning via Device Concatenation. (arXiv:2202.12751v1 [cs.LG])
    As a promising distributed machine learning paradigm, Federated Learning (FL) enables all the involved devices to train a global model collaboratively without exposing their local data privacy. However, for non-IID scenarios, the classification accuracy of FL models decreases drastically due to the weight divergence caused by data heterogeneity. Although various FL variants have been studied to improve model accuracy, most of them still suffer from the problem of non-negligible communication and computation overhead. In this paper, we introduce a novel FL approach named Fed-Cat that can achieve high model accuracy based on our proposed device selection strategy and device concatenation-based local training method. Unlike conventional FL methods that aggregate local models trained on individual devices, FedCat periodically aggregates local models after their traversals through a series of logically concatenated devices, which can effectively alleviate the model weight divergence problem. Comprehensive experimental results on four well-known benchmarks show that our approach can significantly improve the model accuracy of state-of-the-art FL methods without causing extra communication overhead.
    The effect of fatigue on the performance of online writer recognition. (arXiv:2202.12694v1 [cs.CV])
    Background: The performance of biometric modalities based on things done by the subject, like signature and text-based recognition, may be affected by the subject state. Fatigue is one of the conditions that can significantly affect the outcome of handwriting tasks. Recent research has already shown that physical fatigue produces measurable differences in some features extracted from common writing and drawing tasks. It is important to establish to which extent physical fatigue contributes to the intra-person variability observed in these biometric modalities and also to know whether the performance of recognition methods is affected by fatigue. Goal: In this paper we assess the impact of fatigue on intra-user variability and on the performance of signature-based and text-based writer recognition approaches encompassing both identification and verification. Methods: Several signature and text recognition methods are considered and applied to samples gathered after different levels of induced fatigue, measured by metabolic and mechanical assessment and, also by subjective perception. The recognition methods are Dynamic Time Warping and Multi Section Vector Quantization, for signatures, and Allographic Text-Dependent Recognition for text in capital letters. For each fatigue level, the identification and verification performance of these methods is measured. Results: Signature shows no statistically significant intra-user impact, but text does. On the other hand, performance of signature-based recognition approaches is negatively impacted by fatigue whereas the impact is not noticeable in text-based recognition, provided long enough sequences are considered.
    Learning to Identify Perceptual Bugs in 3D Video Games. (arXiv:2202.12884v1 [cs.SE])
    Automated Bug Detection (ABD) in video games is composed of two distinct but complementary problems: automated game exploration and bug identification. Automated game exploration has received much recent attention, spurred on by developments in fields such as reinforcement learning. The complementary problem of identifying the bugs present in a player's experience has for the most part relied on the manual specification of rules. Although it is widely recognised that many bugs of interest cannot be identified with such methods, little progress has been made in this direction. In this work we show that it is possible to identify a range of perceptual bugs using learning-based methods by making use of only the rendered game screen as seen by the player. To support our work, we have developed World of Bugs (WOB) an open platform for testing ABD methods in 3D game environments.
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v1 [stat.ML])
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGP models are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the practical heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (HSVLMC) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with a Bayesian calibration method that (i) takes into account the effect of dimensionality reduction raised by domain mapping in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases.
    Towards Optimal Lower Bounds for k-median and k-means Coresets. (arXiv:2202.12793v1 [cs.DS])
    Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $\Omega(k \varepsilon^{-2} \log n)$ and $\Omega(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $\Omega(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points.
    Deep Learning to advance the Eigenspace Perturbation Method for Turbulence Model Uncertainty Quantification. (arXiv:2202.12378v1 [cs.LG])
    The Reynolds Averaged Navier Stokes (RANS) models are the most common form of model in turbulence simulations. They are used to calculate Reynolds stress tensor and give robust results for engineering flows. But RANS model predictions have large error and uncertainty. In past, there has been some work towards using data-driven methods to increase their accuracy. In this work we outline a machine learning approach to aid the use of the Eigenspace Perturbation Method to predict the uncertainty in the turbulence model prediction. We use a trained neural network to predict the discrepancy in the shape of the RANS predicted Reynolds stress ellipsoid. We apply the model to a number of turbulent flows and demonstrate how the approach correctly identifies the regions in which modeling errors occur when compared to direct numerical simulation (DNS), large eddy simulation (LES) or experimental results from previous works.
    HTGN-BTW: Heterogeneous Temporal Graph Network with Bi-Time-Window Training Strategy for Temporal Link Prediction. (arXiv:2202.12713v1 [cs.SI])
    With the development of temporal networks such as E-commerce networks and social networks, the issue of temporal link prediction has attracted increasing attention in recent years. The Temporal Link Prediction task of WSDM Cup 2022 expects a single model that can work well on two kinds of temporal graphs simultaneously, which have quite different characteristics and data properties, to predict whether a link of a given type will occur between two given nodes within a given time span. Our team, named as nothing here, regards this task as a link prediction task in heterogeneous temporal networks and proposes a generic model, i.e., Heterogeneous Temporal Graph Network (HTGN), to solve such temporal link prediction task with the unfixed time intervals and the diverse link types. That is, HTGN can adapt to the heterogeneity of links and the prediction with unfixed time intervals within an arbitrary given time period. To train the model, we design a Bi-Time-Window training strategy (BTW) which has two kinds of mini-batches from two kinds of time windows. As a result, for the final test, we achieved an AUC of 0.662482 on dataset A, an AUC of 0.906923 on dataset B, and won 2nd place with an Average T-scores of 0.628942.
    Benchmark Assessment for DeepSpeed Optimization Library. (arXiv:2202.12831v1 [cs.LG])
    Deep Learning (DL) models are widely used in machine learning due to their performance and ability to deal with large datasets while producing high accuracy and performance metrics. The size of such datasets and the complexity of DL models cause such models to be complex, consuming large amount of resources and time to train. Many recent libraries and applications are introduced to deal with DL complexity and efficiency issues. In this paper, we evaluated one example, Microsoft DeepSpeed library through classification tasks. DeepSpeed public sources reported classification performance metrics on the LeNet architecture. We extended this through evaluating the library on several modern neural network architectures, including convolutional neural networks (CNNs) and Vision Transformer (ViT). Results indicated that DeepSpeed, while can make improvements in some of those cases, it has no or negative impact on others.
    MetaVA: Curriculum Meta-learning and Pre-fine-tuning of Deep Neural Networks for Detecting Ventricular Arrhythmias based on ECGs. (arXiv:2202.12450v1 [cs.LG])
    Ventricular arrhythmias (VA) are the main causes of sudden cardiac death. Developing machine learning methods for detecting VA based on electrocardiograms (ECGs) can help save people's lives. However, developing such machine learning models for ECGs is challenging because of the following: 1) group-level diversity from different subjects and 2) individual-level diversity from different moments of a single subject. In this study, we aim to solve these problems in the pre-training and fine-tuning stages. For the pre-training stage, we propose a novel model agnostic meta-learning (MAML) with curriculum learning (CL) method to solve group-level diversity. MAML is expected to better transfer the knowledge from a large dataset and use only a few recordings to quickly adapt the model to a new person. CL is supposed to further improve MAML by meta-learning from easy to difficult tasks. For the fine-tuning stage, we propose improved pre-fine-tuning to solve individual-level diversity. We conduct experiments using a combination of three publicly available ECG datasets. The results show that our method outperforms the compared methods in terms of all evaluation metrics. Ablation studies show that MAML and CL could help perform more evenly, and pre-fine-tuning could better fit the model to training data.
    Human-Centered Concept Explanations for Neural Networks. (arXiv:2202.12451v1 [cs.LG])
    Understanding complex machine learning models such as deep neural networks with explanations is crucial in various applications. Many explanations stem from the model perspective, and may not necessarily effectively communicate why the model is making its predictions at the right level of abstraction. For example, providing importance weights to individual pixels in an image can only express which parts of that particular image are important to the model, but humans may prefer an explanation which explains the prediction by concept-based thinking. In this work, we review the emerging area of concept based explanations. We start by introducing concept explanations including the class of Concept Activation Vectors (CAV) which characterize concepts using vectors in appropriate spaces of neural activations, and discuss different properties of useful concepts, and approaches to measure the usefulness of concept vectors. We then discuss approaches to automatically extract concepts, and approaches to address some of their caveats. Finally, we discuss some case studies that showcase the utility of such concept-based explanations in synthetic settings and real world applications.
    Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement. (arXiv:2202.12413v1 [cs.SI])
    Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.
    Context-Hierarchy Inverse Reinforcement Learning. (arXiv:2202.12597v1 [cs.AI])
    An inverse reinforcement learning (IRL) agent learns to act intelligently by observing expert demonstrations and learning the expert's underlying reward function. Although learning the reward functions from demonstrations has achieved great success in various tasks, several other challenges are mostly ignored. Firstly, existing IRL methods try to learn the reward function from scratch without relying on any prior knowledge. Secondly, traditional IRL methods assume the reward functions are homogeneous across all the demonstrations. Some existing IRL methods managed to extend to the heterogeneous demonstrations. However, they still assume one hidden variable that affects the behavior and learn the underlying hidden variable together with the reward from demonstrations. To solve these issues, we present Context Hierarchy IRL(CHIRL), a new IRL algorithm that exploits the context to scale up IRL and learn reward functions of complex behaviors. CHIRL models the context hierarchically as a directed acyclic graph; it represents the reward function as a corresponding modular deep neural network that associates each network module with a node of the context hierarchy. The context hierarchy and the modular reward representation enable data sharing across multiple contexts and state abstraction, significantly improving the learning performance. CHIRL has a natural connection with hierarchical task planning when the context hierarchy represents subtask decomposition. It enables to incorporate the prior knowledge of causal dependencies of subtasks and make it capable of solving large complex tasks by decoupling it into several subtasks and conquering each subtask to solve the original task. Experiments on benchmark tasks, including a large scale autonomous driving task in the CARLA simulator, show promising results in scaling up IRL for tasks with complex reward functions.
    Fine-Grained Prediction of Political Leaning on Social Media with Unsupervised Deep Learning. (arXiv:2202.12382v1 [cs.SI])
    Predicting the political leaning of social media users is an increasingly popular task, given its usefulness for electoral forecasts, opinion dynamics models and for studying the political dimension of polarization and disinformation. Here, we propose a novel unsupervised technique for learning fine-grained political leaning from the textual content of social media posts. Our technique leverages a deep neural network for learning latent political ideologies in a representation learning task. Then, users are projected in a low-dimensional ideology space where they are subsequently clustered. The political leaning of a user is automatically derived from the cluster to which the user is assigned. We evaluated our technique in two challenging classification tasks and we compared it to baselines and other state-of-the-art approaches. Our technique obtains the best results among all unsupervised techniques, with micro F1 = 0.426 in the 8-class task and micro F1 = 0.772 in the 3-class task. Other than being interesting on their own, our results also pave the way for the development of new and better unsupervised approaches for the detection of fine-grained political leaning.
    Long-Term Missing Value Imputation for Time Series Data Using Deep Neural Networks. (arXiv:2202.12441v1 [cs.LG])
    We present an approach that uses a deep learning model, in particular, a MultiLayer Perceptron (MLP), for estimating the missing values of a variable in multivariate time series data. We focus on filling a long continuous gap (e.g., multiple months of missing daily observations) rather than on individual randomly missing observations. Our proposed gap filling algorithm uses an automated method for determining the optimal MLP model architecture, thus allowing for optimal prediction performance for the given time series. We tested our approach by filling gaps of various lengths (three months to three years) in three environmental datasets with different time series characteristics, namely daily groundwater levels, daily soil moisture, and hourly Net Ecosystem Exchange. We compared the accuracy of the gap-filled values obtained with our approach to the widely-used R-based time series gap filling methods ImputeTS and mtsdi. The results indicate that using an MLP for filling a large gap leads to better results, especially when the data behave nonlinearly. Thus, our approach enables the use of datasets that have a large gap in one variable, which is common in many long-term environmental monitoring observations.
    Raman Spectrum Matching with Contrastive Representation Learning. (arXiv:2202.12549v1 [cs.LG])
    Raman spectroscopy is an effective, low-cost, non-intrusive technique often used for chemical identification. Typical approaches are based on matching observations to a reference database, which requires careful preprocessing, or supervised machine learning, which requires a fairly large number of training observations from each class. We propose a new machine learning technique for Raman spectrum matching, based on contrastive representation learning, that requires no preprocessing and works with as little as a single reference spectrum from each class. On three datasets we demonstrate that our approach significantly improves or is on par with the state of the art in prediction accuracy, and we show how to compute conformal prediction sets with specified frequentist coverage. Based on our findings, we believe contrastive representation learning is a promising alternative to existing methods for Raman spectrum matching.
    Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems. (arXiv:2007.03278v2 [cs.LG] UPDATED)
    Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to study a generalization in distributed learning systems that goes beyond existing mechanisms such as federated learning. Moreover, such learning systems rely on hierarchical self-organization of well-connected distributed learning agents who have limited and highly personalized data and can evolve and regulate themselves based on the underlying duality of specialized and generalized processes. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, hierarchical generalized learning problems in recursive forms are formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical update mechanisms. To that end, a distributed learning algorithm, namely DemLearn is proposed. Extensive experiments on benchmark MNIST, Fashion-MNIST, FE-MNIST, and CIFAR-10 datasets show that the proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms. The detailed analysis provides useful observations to further handle both the generalization and specialization performance of the learning models in Dem-AI systems.
    Bayesian autoencoders with uncertainty quantification: Towards trustworthy anomaly detection. (arXiv:2202.12653v1 [cs.LG])
    Despite numerous studies of deep autoencoders (AEs) for unsupervised anomaly detection, AEs still lack a way to express uncertainty in their predictions, crucial for ensuring safe and trustworthy machine learning systems in high-stake applications. Therefore, in this work, the formulation of Bayesian autoencoders (BAEs) is adopted to quantify the total anomaly uncertainty, comprising epistemic and aleatoric uncertainties. To evaluate the quality of uncertainty, we consider the task of classifying anomalies with the additional option of rejecting predictions of high uncertainty. In addition, we use the accuracy-rejection curve and propose the weighted average accuracy as a performance metric. Our experiments demonstrate the effectiveness of the BAE and total anomaly uncertainty on a set of benchmark datasets and two real datasets for manufacturing: one for condition monitoring, the other for quality inspection.
    Exploiting Problem Structure in Deep Declarative Networks: Two Case Studies. (arXiv:2202.12404v1 [cs.LG])
    Deep declarative networks and other recent related works have shown how to differentiate the solution map of a (continuous) parametrized optimization problem, opening up the possibility of embedding mathematical optimization problems into end-to-end learnable models. These differentiability results can lead to significant memory savings by providing an expression for computing the derivative without needing to unroll the steps of the forward-pass optimization procedure during the backward pass. However, the results typically require inverting a large Hessian matrix, which is computationally expensive when implemented naively. In this work we study two applications of deep declarative networks -- robust vector pooling and optimal transport -- and show how problem structure can be exploited to obtain very efficient backward pass computations in terms of both time and memory. Our ideas can be used as a guide for improving the computational performance of other novel deep declarative nodes.
    Novel techniques for improvement the NNetEn entropy calculation for short and noisy time series. (arXiv:2202.12703v1 [cs.LG])
    Entropy is a fundamental concept of information theory. It is widely used in the analysis of analog and digital signals. Conventional entropy measures have drawbacks, such as sensitivity to the length and amplitude of time series and low robustness to external noise. Recently, the NNetEn entropy measure has been introduced to overcome these problems. The NNetEn entropy uses a modified version of the LogNNet neural network classification model. The algorithm contains a reservoir matrix with N = 19625 elements, which the given time series should fill. Many practical time series have less than 19625 elements. Against this background, this paper investigates different duplicating and stretching techniques for filling to overcome this difficulty. The most successful technique is identified for practical applications. The presence of external noise and bias are other important issues affecting the efficiency of entropy measures. In order to perform meaningful analysis, three time series with different dynamics (chaotic, periodic, and binary), with a variation of signal-to-noise ratio (SNR) and offsets, are considered. It is shown that the error in the calculation of the NNetEn entropy does not exceed 10% when the SNR exceeds 30 dB. This opens the possibility of measuring the NNetEn of experimental signals in the presence of noise of various nature, white noise, or 1/f noise, without the need for noise filtering.
    ARIA: Adversarially Robust Image Attribution for Content Provenance. (arXiv:2202.12860v1 [cs.CV])
    Image attribution -- matching an image back to a trusted source -- is an emerging tool in the fight against online misinformation. Deep visual fingerprinting models have recently been explored for this purpose. However, they are not robust to tiny input perturbations known as adversarial examples. First we illustrate how to generate valid adversarial images that can easily cause incorrect image attribution. Then we describe an approach to prevent imperceptible adversarial attacks on deep visual fingerprinting models, via robust contrastive learning. The proposed training procedure leverages training on $\ell_\infty$-bounded adversarial examples, it is conceptually simple and incurs only a small computational overhead. The resulting models are substantially more robust, are accurate even on unperturbed images, and perform well even over a database with millions of images. In particular, we achieve 91.6% standard and 85.1% adversarial recall under $\ell_\infty$-bounded perturbations on manipulated images compared to 80.1% and 0.0% from prior work. We also show that robustness generalizes to other types of imperceptible perturbations unseen during training. Finally, we show how to train an adversarially robust image comparator model for detecting editorial changes in matched images.
    Data refinement for fully unsupervised visual inspection using pre-trained networks. (arXiv:2202.12759v1 [cs.CV])
    Anomaly detection has recently seen great progress in the field of visual inspection. More specifically, the use of classical outlier detection techniques on features extracted by deep pre-trained neural networks have been shown to deliver remarkable performances on the MVTec Anomaly Detection (MVTec AD) dataset. However, like most other anomaly detection strategies, these pre-trained methods assume all training data to be normal. As a consequence, they cannot be considered as fully unsupervised. There exists to our knowledge no work studying these pre-trained methods under fully unsupervised setting. In this work, we first assess the robustness of these pre-trained methods to fully unsupervised context, using polluted training sets (i.e. containing defective samples), and show that these methods are more robust to pollution compared to methods such as CutPaste. We then propose SROC, a Simple Refinement strategy for One Class classification. SROC enables to remove most of the polluted images from the training set, and to recover some of the lost AUC. We further show that our simple heuristic competes with, and even outperforms much more complex strategies from the existing literature.
    Physics solutions for machine learning privacy leaks. (arXiv:2202.12319v1 [cs.CR])
    Machine learning systems are becoming more and more ubiquitous in increasingly complex areas, including cutting-edge scientific research. The opposite is also true: the interest in better understanding the inner workings of machine learning systems motivates their analysis under the lens of different scientific disciplines. Physics is particularly successful in this, due to its ability to describe complex dynamical systems. While explanations of phenomena in machine learning based on physics are increasingly present, examples of direct application of notions akin to physics in order to improve machine learning systems are more scarce. Here we provide one such application in the problem of developing algorithms that preserve the privacy of the manipulated data, which is especially important in tasks such as the processing of medical records. We develop well-defined conditions to guarantee robustness to specific types of privacy leaks, and rigorously prove that such conditions are satisfied by tensor-network architectures. These are inspired by the efficient representation of quantum many-body systems, and have shown to compete and even surpass traditional machine learning architectures in certain cases. Given the growing expertise in training tensor-network architectures, these results imply that one may not have to be forced to make a choice between accuracy in prediction and ensuring the privacy of the information processed.
    Do autoencoders need a bottleneck for anomaly detection?. (arXiv:2202.12637v1 [cs.LG])
    A common belief in designing deep autoencoders (AEs), a type of unsupervised neural network, is that a bottleneck is required to prevent learning the identity function. Learning the identity function renders the AEs useless for anomaly detection. In this work, we challenge this limiting belief and investigate the value of non-bottlenecked AEs. The bottleneck can be removed in two ways: (1) overparameterising the latent layer, and (2) introducing skip connections. However, limited works have reported on the use of one of the ways. For the first time, we carry out extensive experiments covering various combinations of bottleneck removal schemes, types of AEs and datasets. In addition, we propose the infinitely-wide AEs as an extreme example of non-bottlenecked AEs. Their improvement over the baseline implies learning the identity function is not trivial as previously assumed. Moreover, we find that non-bottlenecked architectures (highest AUROC=0.857) can outperform their bottlenecked counterparts (highest AUROC=0.696) on the popular task of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non-bottlenecked AEs for improving anomaly detection.
    Handwriting Biometrics: Applications and Future Trends in e-Security and e-Health. (arXiv:2202.12760v1 [cs.CR])
    Background- This paper summarizes the state-of-the-art and applications based on online handwritting signals with special emphasis on e-security and e-health fields. Methods- In particular, we focus on the main achievements and challenges that should be addressed by the scientific community, providing a guide document for future research. Conclusions- Among all the points discussed in this article, we remark the importance of considering security, health, and metadata from a joint perspective. This is especially critical due to the double use possibilities of these behavioral signals.
    Directed Graph Auto-Encoders. (arXiv:2202.12449v1 [cs.LG])
    We introduce a new class of auto-encoders for directed graphs, motivated by a direct extension of the Weisfeiler-Leman algorithm to pairs of node labels. The proposed model learns pairs of interpretable latent representations for the nodes of directed graphs, and uses parameterized graph convolutional network (GCN) layers for its encoder and an asymmetric inner product decoder. Parameters in the encoder control the weighting of representations exchanged between neighboring nodes. We demonstrate the ability of the proposed model to learn meaningful latent embeddings and achieve superior performance on the directed link prediction task on several popular network datasets.
    Multi-View Graph Representation for Programming Language Processing: An Investigation into Algorithm Detection. (arXiv:2202.12481v1 [cs.LG])
    Program representation, which aims at converting program source code into vectors with automatically extracted features, is a fundamental problem in programming language processing (PLP). Recent work tries to represent programs with neural networks based on source code structures. However, such methods often focus on the syntax and consider only one single perspective of programs, limiting the representation power of models. This paper proposes a multi-view graph (MVG) program representation method. MVG pays more attention to code semantics and simultaneously includes both data flow and control flow as multiple views. These views are then combined and processed by a graph neural network (GNN) to obtain a comprehensive program representation that covers various aspects. We thoroughly evaluate our proposed MVG approach in the context of algorithm detection, an important and challenging subfield of PLP. Specifically, we use a public dataset POJ-104 and also construct a new challenging dataset ALG-109 to test our method. In experiments, MVG outperforms previous methods significantly, demonstrating our model's strong capability of representing source code.  ( 2 min )
    Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph. (arXiv:2202.12307v1 [cs.LG])
    This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at https://ydcustc.github.io/retriever-demo/.  ( 2 min )
    Finite-Sum Compositional Stochastic Optimization: Theory and Applications. (arXiv:2202.12396v1 [math.OC])
    This paper studies stochastic optimization for a sum of compositional functions, where the inner-level function of each summand is coupled with the corresponding summation index. We refer to this family of problems as finite-sum coupled compositional optimization (FCCO). It has broad applications in machine learning for optimizing non-convex or convex compositional measures/objectives such as average precision (AP), $p$-norm push, listwise ranking losses, neighborhood component analysis (NCA), deep survival analysis, deep latent variable models, softmax functions, and model agnostic meta-learning, which deserves finer analysis. Yet, existing algorithms and analysis are restricted in one or other aspects. The contribution of this paper is to provide a comprehensive analysis of a simple stochastic algorithm for both non-convex and convex objectives. The key results are {\bf improved oracle complexities with the parallel speed-up} by the moving-average based stochastic estimator with mini-batching. Our theoretical analysis also exhibits new insights for improving the practical implementation by sampling the batches of equal size for the outer and inner levels. Numerical experiments on AP maximization and $p$-norm push optimization corroborate some aspects of the theory.  ( 2 min )
    Estimators of Entropy and Information via Inference in Probabilistic Models. (arXiv:2202.12363v1 [stat.ML])
    Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.  ( 2 min )
    Reachability analysis in stochastic directed graphs by reinforcement learning. (arXiv:2202.12546v1 [cs.AI])
    We characterize the reachability probabilities in stochastic directed graphs by means of reinforcement learning methods. In particular, we show that the dynamics of the transition probabilities in a stochastic digraph can be modeled via a difference inclusion, which, in turn, can be interpreted as a Markov decision process. Using the latter framework, we offer a methodology to design reward functions to provide upper and lower bounds on the reachability probabilities of a set of nodes for stochastic digraphs. The effectiveness of the proposed technique is demonstrated by application to the diffusion of epidemic diseases over time-varying contact networks generated by the proximity patterns of mobile agents.  ( 2 min )
    Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach. (arXiv:2202.12797v1 [cs.LG])
    Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and attempt to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our work is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model.  ( 2 min )
    Early Disease Stage Characterization in Parkinson's Disease from Resting-state fMRI Data Using a Long Short-term Memory Network. (arXiv:2202.12715v1 [q-bio.NC])
    Parkinson's disease (PD) is a common and complex neurodegenerative disorder with 5 stages in the Hoehn and Yahr scaling. Given the heterogeneity of PD, it is challenging to classify early stages 1 and 2 and detect brain function alterations. Functional magnetic resonance imaging (fMRI) is a promising tool in revealing functional connectivity (FC) differences and developing biomarkers in PD. Some machine learning approaches like support vector machine and logistic regression have been successfully applied in the early diagnosis of PD using fMRI data, which outperform classifiers based on manually selected morphological features. However, the early-stage characterization in FC changes has not been fully investigated. Given the complexity and non-linearity of fMRI data, we propose the use of a long short-term memory (LSTM) network to characterize the early stages of PD. The study included 84 subjects (56 in stage 2 and 28 in stage 1) from the Parkinson's Progression Markers Initiative (PPMI), the largest available public PD dataset. Under a repeated 10-fold stratified cross-validation, the LSTM model reached an accuracy of 71.63%, 13.52% higher than the best traditional machine learning method, indicating significantly better robustness and accuracy compared with other machine learning classifiers. We used the learned LSTM model weights to select the top brain regions that contributed to model prediction and performed FC analyses to characterize functional changes with disease stage and motor impairment to gain better insight into the brain mechanisms of PD.  ( 2 min )
    Learning ECG Representations based on Manipulated Temporal-Spatial Reverse Detection. (arXiv:2202.12458v1 [cs.LG])
    Learning representations from electrocardiogram (ECG) serves as a fundamental step for many downstream machine learning-based ECG analysis tasks. However, the learning process is always restricted by lack of high-quality labeled data in reality. Existing methods addressing data deficiency either cannot provide satisfied representations for downstream tasks or require too much effort to construct similar and dissimilar pairs to learn informative representations. In this paper, we propose a straightforward but effective approach to learn ECG representations. Inspired by the temporal and spatial characteristics of ECG, we flip the original signals horizontally, vertically, and both horizontally and vertically. The learning is then done by classifying the four types of signals including the original one. To verify the effectiveness of the proposed temporal-spatial (T-S) reverse detection method, we conduct a downstream task to detect atrial fibrillation (AF) which is one of the most common ECG tasks. The results show that the ECG representations learned with our method lead to remarkable performances on the downstream task. In addition, after exploring the representational feature space and investigating which parts of the ECG signal contribute to the representations, we conclude that the temporal reverse is more effective than the spatial reverse for learning ECG representations.  ( 2 min )
    Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity. (arXiv:2202.12482v1 [stat.ML])
    Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. We study the theoretical properties for SNAM with novel techniques to tackle the non-parametric truth, thus extending from classical sparse linear models such as the LASSO, which only works on the parametric truth. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as $t\to\infty$, and that the estimation error of SNAM vanishes asymptotically as $n\to\infty$. We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the `identifiability', recovering each feature's effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM.  ( 2 min )
    Cutting Some Slack for SGD with Adaptive Polyak Stepsizes. (arXiv:2202.12328v1 [cs.LG])
    Tuning the step size of stochastic gradient descent is tedious and error prone. This has motivated the development of methods that automatically adapt the step size using readily available information. In this paper, we consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. These are methods that make use of gradient and loss value at the sampled points to adaptively adjust the step size. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models. Our new variants are based on introducing a slack variable into the interpolation equations. This single slack variable tracks the loss function across iterations and is used in setting a stable step size. We provide extensive numerical results supporting our new methods and a convergence theory.  ( 2 min )
    Bidding Agent Design in the LinkedIn Ad Marketplace. (arXiv:2202.12472v1 [cs.GT])
    We establish a general optimization framework for the design of automated bidding agent in dynamic online marketplaces. It optimizes solely for the buyer's interest and is agnostic to the auction mechanism imposed by the seller. As a result, the framework allows, for instance, the joint optimization of a group of ads across multiple platforms each running its own auction format. Bidding strategy derived from this framework automatically guarantees the optimality of budget allocation across ad units and platforms. Common constraints such as budget delivery schedule, return on investments and guaranteed results, directly translates to additional parameters in the bidding formula. We share practical learnings of the deployed bidding system in the LinkedIn ad marketplace based on this framework.  ( 2 min )
    Bayesian Deep Learning for Graphs. (arXiv:2202.12348v1 [cs.LG])
    The adaptive processing of structured data is a long-standing research topic in machine learning that investigates how to automatically learn a mapping from a structured input to outputs of various nature. Recently, there has been an increasing interest in the adaptive processing of graphs, which led to the development of different neural network-based methodologies. In this thesis, we take a different route and develop a Bayesian Deep Learning framework for graph learning. The dissertation begins with a review of the principles over which most of the methods in the field are built, followed by a study on graph classification reproducibility issues. We then proceed to bridge the basic ideas of deep learning for graphs with the Bayesian world, by building our deep architectures in an incremental fashion. This framework allows us to consider graphs with discrete and continuous edge features, producing unsupervised embeddings rich enough to reach the state of the art on several classification tasks. Our approach is also amenable to a Bayesian nonparametric extension that automatizes the choice of almost all model's hyper-parameters. Two real-world applications demonstrate the efficacy of deep learning for graphs. The first concerns the prediction of information-theoretic quantities for molecular simulations with supervised neural models. After that, we exploit our Bayesian models to solve a malware-classification task while being robust to intra-procedural code obfuscation techniques. We conclude the dissertation with an attempt to blend the best of the neural and Bayesian worlds together. The resulting hybrid model is able to predict multimodal distributions conditioned on input graphs, with the consequent ability to model stochasticity and uncertainty better than most works. Overall, we aim to provide a Bayesian perspective into the articulated research field of deep learning for graphs.  ( 2 min )
    Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance. (arXiv:2202.12387v1 [cs.LG])
    In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the optimization perspective, we explain why existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result. In order to remove such requirement, we propose a memory-efficient Stochastic Optimization algorithm for solving the Global objective of Contrastive Learning of Representations, named SogCLR. We show that its optimization error is negligible under a reasonable condition after a sufficient number of iterations or is diminishing for a slightly different global contrastive objective. Empirically, we demonstrate that on ImageNet with a batch size 256, SogCLR achieves a performance of 69.4% for top-1 linear evaluation accuracy using ResNet-50, which is on par with SimCLR (69.3%) with a large batch size 8,192. We also attempt to show that the proposed optimization technique is generic and can be applied to solving other contrastive losses, e.g., two-way contrastive losses for bimodal contrastive learning.  ( 2 min )
    An Ensemble Approach for Patient Prognosis of Head and Neck Tumor Using Multimodal Data. (arXiv:2202.12537v1 [eess.IV])
    Accurate prognosis of a tumor can help doctors provide a proper course of treatment and, therefore, save the lives of many. Traditional machine learning algorithms have been eminently useful in crafting prognostic models in the last few decades. Recently, deep learning algorithms have shown significant improvement when developing diagnosis and prognosis solutions to different healthcare problems. However, most of these solutions rely solely on either imaging or clinical data. Utilizing patient tabular data such as demographics and patient medical history alongside imaging data in a multimodal approach to solve a prognosis task has started to gain more interest recently and has the potential to create more accurate solutions. The main issue when using clinical and imaging data to train a deep learning model is to decide on how to combine the information from these sources. We propose a multimodal network that ensembles deep multi-task logistic regression (MTLR), Cox proportional hazard (CoxPH) and CNN models to predict prognostic outcomes for patients with head and neck tumors using patients' clinical and imaging (CT and PET) data. Features from CT and PET scans are fused and then combined with patients' electronic health records for the prediction. The proposed model is trained and tested on 224 and 101 patient records respectively. Experimental results show that our proposed ensemble solution achieves a C-index of 0.72 on The HECKTOR test set that saved us the first place in prognosis task of the HECKTOR challenge. The full implementation based on PyTorch is available on \url{https://github.com/numanai/BioMedIA-Hecktor2021}.  ( 2 min )
    Machine Learning based refinement strategies for polyhedral grids with applications to Virtual Element and polyhedral Discontinuous Galerkin methods. (arXiv:2202.12654v1 [math.NA])
    We propose two new strategies based on Machine Learning techniques to handle polyhedral grid refinement, to be possibly employed within an adaptive framework. The first one employs the k-means clustering algorithm to partition the points of the polyhedron to be refined. This strategy is a variation of the well known Centroidal Voronoi Tessellation. The second one employs Convolutional Neural Networks to classify the "shape" of an element so that "ad-hoc" refinement criteria can be defined. This strategy can be used to enhance existing refinement strategies, including the k-means strategy, at a low online computational cost. We test the proposed algorithms considering two families of finite element methods that support arbitrarily shaped polyhedral elements, namely the Virtual Element Method (VEM) and the Polygonal Discontinuous Galerkin (PolyDG) method. We demonstrate that these strategies do preserve the structure and the quality of the underlaying grids, reducing the overall computational cost and mesh complexity.  ( 2 min )
    Towards an Accountable and Reproducible Federated Learning: A FactSheets Approach. (arXiv:2202.12443v1 [cs.AI])
    Federated Learning (FL) is a novel paradigm for the shared training of models based on decentralized and private data. With respect to ethical guidelines, FL is promising regarding privacy, but needs to excel vis-\`a-vis transparency and trustworthiness. In particular, FL has to address the accountability of the parties involved and their adherence to rules, law and principles. We introduce AF^2 Framework, where we instrument FL with accountability by fusing verifiable claims with tamper-evident facts, into reproducible arguments. We build on AI FactSheets for instilling transparency and trustworthiness into the AI lifecycle and expand it to incorporate dynamic and nested facts, as well as complex model compositions in FL. Based on our approach, an auditor can validate, reproduce and certify a FL process. This can be directly applied in practice to address the challenges of AI engineering and ethics.  ( 2 min )
    Learning to Combine Instructions in LLVM Compiler. (arXiv:2202.12379v1 [cs.LG])
    Instruction combiner (IC) is a critical compiler optimization pass, which replaces a sequence of instructions with an equivalent and optimized instruction sequence at basic block level. There can be thousands of instruction-combining patterns which need to be frequently updated as new coding idioms/applications and novel hardware evolve over time. This results in frequent updates to the IC optimization pass thereby incurring considerable human effort and high software maintenance costs. To mitigate these challenges associated with the traditional IC, we design and implement a Neural Instruction Combiner (NIC) and demonstrate its feasibility by integrating it into the standard LLVM compiler optimization pipeline. NIC leverages neural sequence-to-sequence (Seq2Seq) models for generating optimized encoded IR sequence from the unoptimized encoded IR sequence. To the best of our knowledge, ours is the first work demonstrating the feasibility of a neural instruction combiner built into a full-fledged compiler pipeline. Given the novelty of this task, we built a new dataset for training our NIC neural model. We show that NIC achieves exact match results percentage of 72% for optimized sequences as compared to traditional IC and neural machine translation metric Bleu precision score of 0.94, demonstrating its feasibility in a production compiler pipeline.  ( 2 min )
    GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection. (arXiv:2202.12478v1 [cs.MM])
    Social media in present times has a significant and growing influence. Fake news being spread on these platforms have a disruptive and damaging impact on our lives. Furthermore, as multimedia content improves the visibility of posts more than text data, it has been observed that often multimedia is being used for creating fake content. A plethora of previous multimodal-based work has tried to address the problem of modeling heterogeneous modalities in identifying fake content. However, these works have the following limitations: (1) inefficient encoding of inter-modal relations by utilizing a simple concatenation operator on the modalities at a later stage in a model, which might result in information loss; (2) training very deep neural networks with a disproportionate number of parameters on small but complex real-life multimodal datasets result in higher chances of overfitting. To address these limitations, we propose GAME-ON, a Graph Neural Network based end-to-end trainable framework that allows granular interactions within and across different modalities to learn more robust data representations for multimodal fake news detection. We use two publicly available fake news datasets, Twitter and Weibo, for evaluations. Our model outperforms on Twitter by an average of 11% and keeps competitive performance on Weibo, within a 2.6% margin, while using 65% fewer parameters than the best comparable state-of-the-art baseline.  ( 2 min )
    Towards Learning Causal Representations from Multi-Instance Bags. (arXiv:2202.12570v1 [cs.LG])
    Although humans can easily identify the object of interest from groups of examples using group-level labels, most of the existing machine learning algorithms can only learn from individually labeled examples. Multi-instance learning (MIL) is a type of weakly supervised learning that deals with objects represented as groups of instances, and is theoretically capable of predicting instance labels from group-level supervision. Unfortunately, most existing MIL algorithms focus on improving the performances of group label predictions and cannot be used to accurately predict instance labels. In this work, we propose the TargetedMIL algorithm, which learns semantically meaningful representations that can be interpreted as causal to the object of interest. Utilizing the inferred representations, TargetedMIL excels at instance label predictions from group-level labels. Qualitative and quantitative evaluations on various datasets demonstrate the effectiveness of TargetedMIL.  ( 2 min )
    AutoIP: A United Framework to Integrate Physics into Gaussian Processes. (arXiv:2202.12316v1 [cs.LG])
    Physics modeling is critical for modern science and engineering applications. From data science perspective, physics knowledge -- often expressed as differential equations -- is valuable in that it is highly complementary to data, and can potentially help overcome data sparsity, noise, inaccuracy, etc. In this work, we propose a simple yet powerful framework that can integrate all kinds of differential equations into Gaussian processes (GPs) to enhance prediction accuracy and uncertainty quantification. These equations can be linear, nonlinear, temporal, time-spatial, complete, incomplete with unknown source terms, etc. Specifically, based on kernel differentiation, we construct a GP prior to jointly sample the values of the target function, equation-related derivatives, and latent source functions from a multivariate Gaussian distribution. The sampled values are fed to two likelihoods -- one is to fit the observations and the other to conform to the equation. We use the whitening trick to evade the strong dependency between the sampled function values and kernel parameters, and develop a stochastic variational learning algorithm. Our method shows improvement upon vanilla GPs in both simulation and several real-world applications, even using rough, incomplete equations.  ( 2 min )
    Domain Adaptation: the Key Enabler of Neural Network Equalizers in Coherent Optical Systems. (arXiv:2202.12689v1 [eess.SP])
    We introduce the domain adaptation and randomization approach for calibrating neural network-based equalizers for real transmissions, using synthetic data. The approach renders up to 99\% training process reduction, which we demonstrate in three experimental setups.  ( 2 min )
    State-of-the-art in speaker recognition. (arXiv:2202.12705v1 [eess.AS])
    Recent advances in speech technologies have produced new tools that can be used to improve the performance and flexibility of speaker recognition While there are few degrees of freedom or alternative methods when using fingerprint or iris identification techniques, speech offers much more flexibility and different levels for performing recognition: the system can force the user to speak in a particular manner, different for each attempt to enter. Also with voice input the system has other degrees of freedom, such as the use of knowledge/codes that only the user knows, or dialectical/semantical traits that are difficult to forge. This paper offers and overview of the state of the art in speaker recognition, with special emphasis on the pros and contras, and the current research lines. The current research lines include improved classification systems, and the use of high level information by means of probabilistic grammars. In conclusion, speaker recognition is far away from being a technology where all the possibilities have already been explored.  ( 2 min )
    Highly-Efficient Binary Neural Networks for Visual Place Recognition. (arXiv:2202.12375v1 [cs.CV])
    VPR is a fundamental task for autonomous navigation as it enables a robot to localize itself in the workspace when a known location is detected. Although accuracy is an essential requirement for a VPR technique, computational and energy efficiency are not less important for real-world applications. CNN-based techniques archive state-of-the-art VPR performance but are computationally intensive and energy demanding. Binary neural networks (BNN) have been recently proposed to address VPR efficiently. Although a typical BNN is an order of magnitude more efficient than a CNN, its processing time and energy usage can be further improved. In a typical BNN, the first convolution is not completely binarized for the sake of accuracy. Consequently, the first layer is the slowest network stage, requiring a large share of the entire computational effort. This paper presents a class of BNNs for VPR that combines depthwise separable factorization and binarization to replace the first convolutional layer to improve computational and energy efficiency. Our best model achieves state-of-the-art VPR performance while spending considerably less time and energy to process an image than a BNN using a non-binary convolution as a first stage.  ( 2 min )
    Addressing Over-Smoothing in Graph Neural Networks via Deep Supervision. (arXiv:2202.12508v1 [cs.LG])
    Learning useful node and graph representations with graph neural networks (GNNs) is a challenging task. It is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. To address this problem, we propose deeply-supervised GNNs (DSGNNs), i.e., GNNs enhanced with deep supervision where representations learned at all layers are used for training. We show empirically that DSGNNs are resilient to over-smoothing and can outperform competitive benchmarks on node and graph property prediction problems.  ( 2 min )
    Deep Dirichlet uncertainty for unsupervised out-of-distribution detection of eye fundus photographs in glaucoma screening. (arXiv:2202.12634v1 [cs.CV])
    The development of automatic tools for early glaucoma diagnosis with color fundus photographs can significantly reduce the impact of this disease. However, current state-of-the-art solutions are not robust to real-world scenarios, providing over-confident predictions for out-of-distribution cases. With this in mind, we propose a model based on the Dirichlet distribution that allows to obtain class-wise probabilities together with an uncertainty estimation without exposure to out-of-distribution cases. We demonstrate our approach on the AIROGS challenge. At the start of the final test phase (8 Feb. 2022), our method had the highest average score among all submissions.  ( 2 min )
    Standard Deviation-Based Quantization for Deep Neural Networks. (arXiv:2202.12422v1 [cs.LG])
    Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.  ( 2 min )
    Learning Transferable Reward for Query Object Localization with Policy Adaptation. (arXiv:2202.12403v1 [cs.CV])
    We propose a reinforcement learning based approach to \emph{query object localization}, for which an agent is trained to localize objects of interest specified by a small exemplary set. We learn a transferable reward signal formulated using the exemplary set by ordinal metric learning. Our proposed method enables test-time policy adaptation to new environments where the reward signals are not readily available, and outperforms fine-tuning approaches that are limited to annotated images. In addition, the transferable reward allows repurposing the trained agent from one specific class to another class. Experiments on corrupted MNIST, CU-Birds, and COCO datasets demonstrate the effectiveness of our approach.  ( 2 min )
    The rise of the lottery heroes: why zero-shot pruning is hard. (arXiv:2202.12400v1 [cs.LG])
    Recent advances in deep learning optimization showed that just a subset of parameters are really necessary to successfully train a model. Potentially, such a discovery has broad impact from the theory to application; however, it is known that finding these trainable sub-network is a typically costly process. This inhibits practical applications: can the learned sub-graph structures in deep learning models be found at training time? In this work we explore such a possibility, observing and motivating why common approaches typically fail in the extreme scenarios of interest, and proposing an approach which potentially enables training with reduced computational effort. The experiments on either challenging architectures and datasets suggest the algorithmic accessibility over such a computational gain, and in particular a trade-off between accuracy achieved and training complexity deployed emerges.  ( 2 min )
    Learn From the Past: Experience Ensemble Knowledge Distillation. (arXiv:2202.12488v1 [cs.CV])
    Traditional knowledge distillation transfers "dark knowledge" of a pre-trained teacher network to a student network, and ignores the knowledge in the training process of the teacher, which we call teacher's experience. However, in realistic educational scenarios, learning experience is often more important than learning results. In this work, we propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer, named experience ensemble knowledge distillation (EEKD). We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique. A self-attention module is used to adaptively assign weights to different intermediate models in the process of knowledge transfer. Three principles of constructing EEKD on the quality, weights and number of intermediate models are explored. A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students. The experimental results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream knowledge distillation methods and achieves the state-of-the-art. In particular, EEKD even surpasses the standard ensemble distillation on the premise of saving training cost.  ( 2 min )
  • Open

    A Robust Multi-Objective Bayesian Optimization Framework Considering Input Uncertainty. (arXiv:2202.12848v1 [cs.AI])
    Bayesian optimization is a popular tool for data-efficient optimization of expensive objective functions. In real-life applications like engineering design, the designer often wants to take multiple objectives as well as input uncertainty into account to find a set of robust solutions. While this is an active topic in single-objective Bayesian optimization, it is less investigated in the multi-objective case. We introduce a novel Bayesian optimization framework to efficiently perform multi-objective optimization considering input uncertainty. We propose a robust Gaussian Process model to infer the Bayes risk criterion to quantify robustness, and we develop a two-stage Bayesian optimization process to search for a robust Pareto frontier. The complete framework supports various distributions of the input uncertainty and takes full advantage of parallel computing. We demonstrate the effectiveness of the framework through numerical benchmarks.
    A function space analysis of finite neural networks with insights from sampling theory. (arXiv:2004.06989v2 [cs.LG] UPDATED)
    This work suggests using sampling theory to analyze the function space represented by neural networks. First, it shows, under the assumption of a finite input domain, which is the common case in training neural networks, that the function space generated by multi-layer networks with non-expansive activation functions is smooth. This extends over previous works that show results for the case of infinite width ReLU networks. Then, under the assumption that the input is band-limited, we provide novel error bounds for univariate neural networks. We analyze both deterministic uniform and random sampling showing the advantage of the former.
    Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems. (arXiv:2007.03278v2 [cs.LG] UPDATED)
    Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to study a generalization in distributed learning systems that goes beyond existing mechanisms such as federated learning. Moreover, such learning systems rely on hierarchical self-organization of well-connected distributed learning agents who have limited and highly personalized data and can evolve and regulate themselves based on the underlying duality of specialized and generalized processes. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, hierarchical generalized learning problems in recursive forms are formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical update mechanisms. To that end, a distributed learning algorithm, namely DemLearn is proposed. Extensive experiments on benchmark MNIST, Fashion-MNIST, FE-MNIST, and CIFAR-10 datasets show that the proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms. The detailed analysis provides useful observations to further handle both the generalization and specialization performance of the learning models in Dem-AI systems.
    Long Expressive Memory for Sequence Modeling. (arXiv:2110.04744v2 [cs.LG] UPDATED)
    We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies. LEM is gradient-based, it can efficiently process sequential tasks with very long-term dependencies, and it is sufficiently expressive to be able to learn complicated input-output maps. To derive LEM, we consider a system of multiscale ordinary differential equations, as well as a suitable time-discretization of this system. For LEM, we derive rigorous bounds to show the mitigation of the exploding and vanishing gradients problem, a well-known challenge for gradient-based recurrent sequential learning methods. We also prove that LEM can approximate a large class of dynamical systems to high accuracy. Our empirical results, ranging from image and time-series classification through dynamical systems prediction to speech recognition and language modeling, demonstrate that LEM outperforms state-of-the-art recurrent neural networks, gated recurrent units, and long short-term memory models.
    Heavy-tailed Streaming Statistical Estimation. (arXiv:2108.11483v2 [cs.LG] UPDATED)
    We consider the task of heavy-tailed statistical estimation given streaming $p$-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional $O(p)$ space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using $O(1)$ batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.
    Convex Analysis of the Mean Field Langevin Dynamics. (arXiv:2201.10469v2 [stat.ML] UPDATED)
    As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics recently attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a concise and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a proximal Gibbs distribution $p_q$ associated with the dynamics, which, in combination with techniques in [Vempala and Wibisono (2019)], allows us to develop a simple convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that $p_q$ connects to the duality gap in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.
    Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach. (arXiv:2202.12797v1 [cs.LG])
    Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and attempt to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our work is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model.
    Gen\'eLive! Generating Rhythm Actions in Love Live!. (arXiv:2202.12823v1 [cs.LG])
    A rhythm action game is a music-based video game in which the player is challenged to issue commands at the right timings during a music session. The timings are rendered in the chart, which consists of visual symbols, called notes, flying through the screen. KLab Inc., a Japan-based video game developer, has operated rhythm action games including a title for the "Love Live!" franchise, which became a hit across Asia and beyond. Before this work, the company generated the charts manually, which resulted in a costly business operation. This paper presents how KLab applied a deep generative model for synthesizing charts, and shows how it has improved the chart production process, reducing the business cost by half. Existing generative models generated poor quality charts for easier difficulty modes. We report how we overcame this challenge through a multi-scaling model dedicated to rhythm actions, by considering beats among other things. Our model, named Gen\'eLive!, is evaluated using production datasets at KLab as well as open datasets.  ( 2 min )
    Causal discovery for observational sciences using supervised machine learning. (arXiv:2202.12813v1 [stat.ME])
    Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.  ( 2 min )
    Obtaining Causal Information by Merging Datasets with MAXENT. (arXiv:2107.07640v2 [stat.ME] UPDATED)
    The investigation of the question "which treatment has a causal effect on a target variable?" is of particular relevance in a large number of scientific disciplines. This challenging task becomes even more difficult if not all treatment variables were or even cannot be observed jointly with the target variable. Another similarly important and challenging task is to quantify the causal influence of a treatment on a target in the presence of confounders. In this paper, we discuss how causal knowledge can be obtained without having observed all variables jointly, but by merging the statistical information from different datasets. We show how the maximum entropy principle can be used to identify edges among random variables when assuming causal sufficiency and an extended version of faithfulness, and when only subsets of the variables have been observed jointly.  ( 2 min )
    Limit Distribution Theory for the Smooth 1-Wasserstein Distance with Applications. (arXiv:2107.13494v5 [math.ST] UPDATED)
    The smooth 1-Wasserstein distance (SWD) $W_1^\sigma$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation while preserving the Wasserstein structure. Indeed, SWD exhibits parametric convergence rates and inherits the metric and topological structure of the classic Wasserstein distance. Motivated by the above, this work conducts a thorough statistical study of the SWD, including a high-dimensional limit distribution result for empirical $W_1^\sigma$, bootstrap consistency, concentration inequalities, and Berry-Esseen type bounds. The derived nondegenerate limit stands in sharp contrast with the classic empirical $W_1$, for which a similar result is known only in the one-dimensional case. We also explore asymptotics and characterize the limit distribution when the smoothing parameter $\sigma$ is scaled with $n$, converging to $0$ at a sufficiently slow rate. The dimensionality of the sampled distribution enters empirical SWD convergence bounds only through the prefactor (i.e., the constant). We provide a sharp characterization of this prefactor's dependence on the smoothing parameter and the intrinsic dimension. This result is then used to derive new empirical convergence rates for classic $W_1$ in terms of the intrinsic dimension. As applications of the limit distribution theory, we study two-sample testing and minimum distance estimation (MDE) under $W_1^\sigma$. We establish asymptotic validity of SWD testing, while for MDE, we prove measurability, almost sure convergence, and limit distributions for optimal estimators and their corresponding $W_1^\sigma$ error. Our results suggest that the SWD is well suited for high-dimensional statistical learning and inference.  ( 2 min )
    Benchmarking Generative Latent Variable Models for Speech. (arXiv:2202.12707v1 [eess.AS])
    Stochastic latent variable models (LVMs) achieve state-of-the-art performance on natural image generation but are still inferior to deterministic models on speech. In this paper, we develop a speech benchmark of popular temporal LVMs and compare them against state-of-the-art deterministic models. We report the likelihood, which is a much used metric in the image domain, but rarely, and often incomparably, reported for speech models. To assess the quality of the learned representations, we also compare their usefulness for phoneme recognition. Finally, we adapt the Clockwork VAE, a state-of-the-art temporal LVM for video generation, to the speech domain. Despite being autoregressive only in latent space, we find that the Clockwork VAE can outperform previous LVMs and reduce the gap to deterministic models by using a hierarchy of latent variables.  ( 2 min )
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v2 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath ~(ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman~(arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker~(arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.  ( 2 min )
    Meta-Learning for Simple Regret Minimization. (arXiv:2202.12888v1 [cs.LG])
    We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist algorithms for this meta-learning problem. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere $\tilde{O}(m / \sqrt{n})$. This is while we show that the meta simple regret of the frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$, and thus, worse. However, the algorithm is more general, because it does not need a prior distribution over the meta-parameters, and is easier to implement for various distributions. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.  ( 2 min )
    Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects. (arXiv:2202.12891v1 [stat.ML])
    Estimating heterogeneous treatment effects is an important problem across many domains. In order to accurately estimate such treatment effects, one typically relies on data from observational studies or randomized experiments. Currently, most existing works rely exclusively on observational data, which is often confounded and, hence, yields biased estimates. While observational data is confounded, randomized data is unconfounded, but its sample size is usually too small to learn heterogeneous treatment effects. In this paper, we propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data via representation learning. In particular, we introduce a two-step framework: first, we use observational data to learn a shared structure (in form of a representation); and then, we use randomized data to learn the data-specific structures. We analyze the finite sample properties of our framework and compare them to several natural baselines. As such, we derive conditions for when combining observational and randomized data is beneficial, and for when it is not. Based on this, we introduce a sample-efficient algorithm, called CorNet. We use extensive simulation studies to verify the theoretical properties of CorNet and multiple real-world datasets to demonstrate our method's superiority compared to existing methods.  ( 2 min )
    Optimal distributed composite testing in high-dimensional Gaussian models with 1-bit communication. (arXiv:2012.04957v2 [cs.IT] UPDATED)
    In this paper we study the problem of signal detection in Gaussian noise in a distributed setting where the local machines in the star topology can communicate a single bit of information. We derive a lower bound on the Euclidian norm that the signal needs to have in order to be detectable. Moreover, we exhibit optimal distributed testing strategies that attain the lower bound.  ( 2 min )
    Efficient Neural Causal Discovery without Acyclicity Constraints. (arXiv:2107.10483v3 [cs.LG] UPDATED)
    Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields. A promising direction is continuous optimization for score-based methods, which, however, require constrained optimization to enforce acyclicity or lack convergence guarantees. In this paper, we present ENCO, an efficient structure learning method for directed, acyclic causal graphs leveraging observational and interventional data. ENCO formulates the graph search as an optimization of independent edge likelihoods, with the edge orientation being modeled as a separate parameter. Consequently, we can provide convergence guarantees of ENCO under mild conditions without constraining the score function with respect to acyclicity. In experiments, we show that ENCO can efficiently recover graphs with hundreds of nodes, an order of magnitude larger than what was previously possible, while handling deterministic variables and latent confounders.  ( 2 min )
    Exploratory Hidden Markov Factor Models for Longitudinal Mobile Health Data: Application to Adverse Posttraumatic Neuropsychiatric Sequelae. (arXiv:2202.12819v1 [stat.AP])
    Adverse posttraumatic neuropsychiatric sequelae (APNS) are common among veterans and millions of Americans after traumatic events and cause tremendous burdens for trauma survivors and society. Many studies have been conducted to investigate the challenges in diagnosing and treating APNS symptoms. However, progress has been limited by the subjective nature of traditional measures. This study is motivated by the objective mobile device data collected from the Advancing Understanding of RecOvery afteR traumA (AURORA) study. We develop both discrete-time and continuous-time exploratory hidden Markov factor models to model the dynamic psychological conditions of individuals with either regular or irregular measurements. The proposed models extend the conventional hidden Markov models to allow high-dimensional data and feature-based nonhomogeneous transition probability between hidden psychological states. To find the maximum likelihood estimates, we develop a Stabilized Expectation-Maximization algorithm with Initialization Strategies (SEMIS). Simulation studies with synthetic data are carried out to assess the performance of parameter estimation and model selection. Finally, an application to the AURORA data is conducted, which captures the relationships between heart rate variability, activity, and APNS consistent with existing literature.  ( 2 min )
    High-Dimensional Sparse Bayesian Learning without Covariance Matrices. (arXiv:2202.12808v1 [eess.SP])
    Sparse Bayesian learning (SBL) is a powerful framework for tackling the sparse coding problem. However, the most popular inference algorithms for SBL become too expensive for high-dimensional settings, due to the need to store and compute a large covariance matrix. We introduce a new inference scheme that avoids explicit construction of the covariance matrix by solving multiple linear systems in parallel to obtain the posterior moments for SBL. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory, especially for structured dictionaries capable of fast matrix-vector multiplication.  ( 2 min )
    Bayesian autoencoders with uncertainty quantification: Towards trustworthy anomaly detection. (arXiv:2202.12653v1 [cs.LG])
    Despite numerous studies of deep autoencoders (AEs) for unsupervised anomaly detection, AEs still lack a way to express uncertainty in their predictions, crucial for ensuring safe and trustworthy machine learning systems in high-stake applications. Therefore, in this work, the formulation of Bayesian autoencoders (BAEs) is adopted to quantify the total anomaly uncertainty, comprising epistemic and aleatoric uncertainties. To evaluate the quality of uncertainty, we consider the task of classifying anomalies with the additional option of rejecting predictions of high uncertainty. In addition, we use the accuracy-rejection curve and propose the weighted average accuracy as a performance metric. Our experiments demonstrate the effectiveness of the BAE and total anomaly uncertainty on a set of benchmark datasets and two real datasets for manufacturing: one for condition monitoring, the other for quality inspection.  ( 2 min )
    Equilibrium Aggregation: Encoding Sets via Optimization. (arXiv:2202.12795v1 [cs.LG])
    Processing sets or other unordered, potentially variable-sized inputs in neural networks is usually handled by \emph{aggregating} a number of input tensors into a single representation. While a number of aggregation methods already exist from simple sum pooling to multi-head attention, they are limited in their representational power both from theoretical and empirical perspectives. On the search of a principally more powerful aggregation strategy, we propose an optimization-based method called Equilibrium Aggregation. We show that many existing aggregation methods can be recovered as special cases of Equilibrium Aggregation and that it is provably more efficient in some important cases. Equilibrium Aggregation can be used as a drop-in replacement in many existing architectures and applications. We validate its efficiency on three different tasks: median estimation, class counting, and molecular property prediction. In all experiments, Equilibrium Aggregation achieves higher performance than the other aggregation techniques we test.  ( 2 min )
    Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice. (arXiv:2202.12780v1 [stat.ML])
    One of the main tasks of actuaries and data scientists is to build good predictive models for certain phenomena such as the claim size or the number of claims in insurance. These models ideally exploit given feature information to enhance the accuracy of prediction. This user guide revisits and clarifies statistical techniques to assess the calibration or adequacy of a model on the one hand, and to compare and rank different models on the other hand. In doing so, it emphasises the importance of specifying the prediction target at hand a priori and of choosing the scoring function in model comparison in line with this target. Guidance for the practical choice of the scoring function is provided. Striving to bridge the gap between science and daily practice in application, it focuses mainly on the pedagogical presentation of existing results and of best practice. The results are accompanied and illustrated by two real data case studies on workers' compensation and customer churn.  ( 2 min )
    Biological error correction codes generate fault-tolerant neural networks. (arXiv:2202.12887v1 [cs.LG])
    It has been an open question in deep learning if fault-tolerant computation is possible: can arbitrarily reliable computation be achieved using only unreliable neurons? In the mammalian cortex, analog error correction codes known as grid codes have been observed to protect states against neural spiking noise, but their role in information processing is unclear. Here, we use these biological codes to show that a universal fault-tolerant neural network can be achieved if the faultiness of each neuron lies below a sharp threshold, which we find coincides in order of magnitude with noise observed in biological neurons. The discovery of a sharp phase transition from faulty to fault-tolerant neural computation opens a path towards understanding noisy analog systems in artificial intelligence and neuroscience.  ( 2 min )
    A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits. (arXiv:2108.11345v3 [cs.LG] UPDATED)
    This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals $\rho$ that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytical toolkits, we analyse the algorithm $\rho$-MTS (for multinomial distributions) and prove that they admit asymptotically optimal regret bounds of risk-averse algorithms under the mean-variance, CVaR, and other ubiquitous risk measures. Numerical simulations show that the regret bounds incurred by our algorithms are reasonably tight vis-\`a-vis algorithm-independent lower bounds.  ( 2 min )
    Confidence Calibration for Object Detection and Segmentation. (arXiv:2202.12785v1 [cs.CV])
    Calibrated confidence estimates obtained from neural networks are crucial, particularly for safety-critical applications such as autonomous driving or medical image diagnosis. However, although the task of confidence calibration has been investigated on classification problems, thorough in\-ves\-tiga\-tions on object detection and segmentation problems are still missing. Therefore, we focus on the investigation of confidence calibration for object detection and segmentation models in this chapter. We introduce the concept of multivariate confidence calibration that is an extension of well-known calibration methods to the task of object detection and segmentation. This allows for an extended confidence calibration that is also aware of additional features such as bounding box/pixel position, shape information, etc. Furthermore, we extend the expected calibration error (ECE) to measure mis\-ca\-li\-bra\-tion of object detection and segmentation models. We examine several network architectures on MS COCO as well as on Cityscapes and show that especially object detection as well as instance segmentation models are intrinsically miscalibrated given the introduced definition of calibration. Using our proposed calibration methods, we have been able to improve calibration so that it also has a positive impact on the quality of segmentation masks as well.  ( 2 min )
    Matrix factorisation and the interpretation of geodesic distance. (arXiv:2106.01260v2 [stat.ML] UPDATED)
    Given a graph or similarity matrix, we consider the problem of recovering a notion of true distance between the nodes, and so their true positions. We show that this can be accomplished in two steps: matrix factorisation, followed by nonlinear dimension reduction. This combination is effective because the point cloud obtained in the first step lives close to a manifold in which latent distance is encoded as geodesic distance. Hence, a nonlinear dimension reduction tool, approximating geodesic distance, can recover the latent positions, up to a simple transformation. We give a detailed account of the case where spectral embedding is used, followed by Isomap, and provide encouraging experimental evidence for other combinations of techniques.  ( 2 min )
    Tensor-Train Density Estimation. (arXiv:2108.00089v2 [cs.LG] UPDATED)
    Estimation of probability density function from samples is one of the central problems in statistics and machine learning. Modern neural network-based models can learn high dimensional distributions but have problems with hyperparameter selection and are often prone to instabilities during training and inference. We propose a new efficient tensor train-based model for density estimation (TTDE). Such density parametrization allows exact sampling, calculation of cumulative and marginal density functions, and partition function. It also has very intuitive hyperparameters. We develop an efficient non-adversarial training procedure for TTDE based on the Riemannian optimization. Experimental results demonstrate the competitive performance of the proposed method in density estimation and sampling tasks, while TTDE significantly outperforms competitors in training speed.  ( 2 min )
    Stacked Residuals of Dynamic Layers for Time Series Anomaly Detection. (arXiv:2202.12457v1 [cs.LG])
    We present an end-to-end differentiable neural network architecture to perform anomaly detection in multivariate time series by incorporating a Sequential Probability Ratio Test on the prediction residual. The architecture is a cascade of dynamical systems designed to separate linearly predictable components of the signal such as trends and seasonality, from the non-linear ones. The former are modeled by local Linear Dynamic Layers, and their residual is fed to a generic Temporal Convolutional Network that also aggregates global statistics from different time series as context for the local predictions of each one. The last layer implements the anomaly detector, which exploits the temporal structure of the prediction residuals to detect both isolated point anomalies and set-point changes. It is based on a novel application of the classic CUMSUM algorithm, adapted through the use of a variational approximation of f-divergences. The model automatically adapts to the time scales of the observed signals. It approximates a SARIMA model at the get-go, and auto-tunes to the statistics of the signal and its covariates, without the need for supervision, as more data is observed. The resulting system, which we call STRIC, outperforms both state-of-the-art robust statistical methods and deep neural network architectures on multiple anomaly detection benchmarks.  ( 2 min )
    Sparse Neural Additive Model: Interpretable Deep Learning with Feature Selection via Group Sparsity. (arXiv:2202.12482v1 [stat.ML])
    Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. We study the theoretical properties for SNAM with novel techniques to tackle the non-parametric truth, thus extending from classical sparse linear models such as the LASSO, which only works on the parametric truth. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as $t\to\infty$, and that the estimation error of SNAM vanishes asymptotically as $n\to\infty$. We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the `identifiability', recovering each feature's effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM.  ( 2 min )
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v1 [stat.ML])
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGP models are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the practical heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (HSVLMC) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with a Bayesian calibration method that (i) takes into account the effect of dimensionality reduction raised by domain mapping in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases.  ( 2 min )
    Bidding Agent Design in the LinkedIn Ad Marketplace. (arXiv:2202.12472v1 [cs.GT])
    We establish a general optimization framework for the design of automated bidding agent in dynamic online marketplaces. It optimizes solely for the buyer's interest and is agnostic to the auction mechanism imposed by the seller. As a result, the framework allows, for instance, the joint optimization of a group of ads across multiple platforms each running its own auction format. Bidding strategy derived from this framework automatically guarantees the optimality of budget allocation across ad units and platforms. Common constraints such as budget delivery schedule, return on investments and guaranteed results, directly translates to additional parameters in the bidding formula. We share practical learnings of the deployed bidding system in the LinkedIn ad marketplace based on this framework.  ( 2 min )
    Do autoencoders need a bottleneck for anomaly detection?. (arXiv:2202.12637v1 [cs.LG])
    A common belief in designing deep autoencoders (AEs), a type of unsupervised neural network, is that a bottleneck is required to prevent learning the identity function. Learning the identity function renders the AEs useless for anomaly detection. In this work, we challenge this limiting belief and investigate the value of non-bottlenecked AEs. The bottleneck can be removed in two ways: (1) overparameterising the latent layer, and (2) introducing skip connections. However, limited works have reported on the use of one of the ways. For the first time, we carry out extensive experiments covering various combinations of bottleneck removal schemes, types of AEs and datasets. In addition, we propose the infinitely-wide AEs as an extreme example of non-bottlenecked AEs. Their improvement over the baseline implies learning the identity function is not trivial as previously assumed. Moreover, we find that non-bottlenecked architectures (highest AUROC=0.857) can outperform their bottlenecked counterparts (highest AUROC=0.696) on the popular task of CIFAR (inliers) vs SVHN (anomalies), among other tasks, shedding light on the potential of developing non-bottlenecked AEs for improving anomaly detection.  ( 2 min )
    On Learning and Testing of Counterfactual Fairness through Data Preprocessing. (arXiv:2202.12440v1 [stat.ML])
    Machine learning has become more important in real-life decision-making but people are concerned about the ethical problems it may bring when used improperly. Recent work brings the discussion of machine learning fairness into the causal framework and elaborates on the concept of Counterfactual Fairness. In this paper, we develop the Fair Learning through dAta Preprocessing (FLAP) algorithm to learn counterfactually fair decisions from biased training data and formalize the conditions where different data preprocessing procedures should be used to guarantee counterfactual fairness. We also show that Counterfactual Fairness is equivalent to the conditional independence of the decisions and the sensitive attributes given the processed non-sensitive attributes, which enables us to detect discrimination in the original decision using the processed data. The performance of our algorithm is illustrated using simulated data and real-world applications.  ( 2 min )
    Bayesian Deep Learning for Graphs. (arXiv:2202.12348v1 [cs.LG])
    The adaptive processing of structured data is a long-standing research topic in machine learning that investigates how to automatically learn a mapping from a structured input to outputs of various nature. Recently, there has been an increasing interest in the adaptive processing of graphs, which led to the development of different neural network-based methodologies. In this thesis, we take a different route and develop a Bayesian Deep Learning framework for graph learning. The dissertation begins with a review of the principles over which most of the methods in the field are built, followed by a study on graph classification reproducibility issues. We then proceed to bridge the basic ideas of deep learning for graphs with the Bayesian world, by building our deep architectures in an incremental fashion. This framework allows us to consider graphs with discrete and continuous edge features, producing unsupervised embeddings rich enough to reach the state of the art on several classification tasks. Our approach is also amenable to a Bayesian nonparametric extension that automatizes the choice of almost all model's hyper-parameters. Two real-world applications demonstrate the efficacy of deep learning for graphs. The first concerns the prediction of information-theoretic quantities for molecular simulations with supervised neural models. After that, we exploit our Bayesian models to solve a malware-classification task while being robust to intra-procedural code obfuscation techniques. We conclude the dissertation with an attempt to blend the best of the neural and Bayesian worlds together. The resulting hybrid model is able to predict multimodal distributions conditioned on input graphs, with the consequent ability to model stochasticity and uncertainty better than most works. Overall, we aim to provide a Bayesian perspective into the articulated research field of deep learning for graphs.  ( 2 min )
    Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance. (arXiv:2202.12387v1 [cs.LG])
    In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the optimization perspective, we explain why existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result. In order to remove such requirement, we propose a memory-efficient Stochastic Optimization algorithm for solving the Global objective of Contrastive Learning of Representations, named SogCLR. We show that its optimization error is negligible under a reasonable condition after a sufficient number of iterations or is diminishing for a slightly different global contrastive objective. Empirically, we demonstrate that on ImageNet with a batch size 256, SogCLR achieves a performance of 69.4% for top-1 linear evaluation accuracy using ResNet-50, which is on par with SimCLR (69.3%) with a large batch size 8,192. We also attempt to show that the proposed optimization technique is generic and can be applied to solving other contrastive losses, e.g., two-way contrastive losses for bimodal contrastive learning.  ( 2 min )
    Estimators of Entropy and Information via Inference in Probabilistic Models. (arXiv:2202.12363v1 [stat.ML])
    Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.  ( 2 min )
    Learning Invariant Weights in Neural Networks. (arXiv:2202.12439v1 [stat.ML])
    Assumptions about invariances or symmetries in data can significantly increase the predictive power of statistical models. Many commonly used models in machine learning are constraint to respect certain symmetries in the data, such as translation equivariance in convolutional neural networks, and incorporation of new symmetry types is actively being studied. Yet, efforts to learn such invariances from the data itself remains an open research problem. It has been shown that marginal likelihood offers a principled way to learn invariances in Gaussian Processes. We propose a weight-space equivalent to this approach, by minimizing a lower bound on the marginal likelihood to learn invariances in neural networks resulting in naturally higher performing models.  ( 2 min )

  • Open

    [D] averaging convolutions of different kernel sizes
    Let’s say you are deal with a signal and you want to convince over the time dimension. Does it make sense to convolve let’s say with a few different kernel sizes and then average the output as the input feature to your next layer? Better yet is anyone aware of any references where people have done this averaging approach at different scales? submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [D] Scores in ACL rolling review
    What does a score of 2.5 given by the reviewers in ACL rolling review represent? I heard that if I want to respond to the reviews then I have to make a resubmission. How fast will they respond to the resubmissions? What's the difference between a "review" and a "meta-review"? ​ How bad is 2.5/2.5/2 for NAACL or even COLING? submitted by /u/SiegeMemeLord [link] [comments]  ( 1 min )
    [D][R] Has this type of Memory-Augmented Neural Network architecture ever been tried before?
    I would like to know if the following architecture has already been studied. The purpose is to make a neural network use an external database, but with a twist: A CNN (or RNN, but CNN is simpler) using as input "I" a list of N concatenated pictures, as output "O" a single picture, representing whatever prediction you need to do, and a float "L", representing the predicted loss error of "O". In other words, the network will attempt to predict its own error amount. The idea is, after every forward pass, you do a backward pass but with the weights of the network frozen and you allow the input tensor to get modified in order to maximize "L", ignoring entirely "O". This happens both during training and during evaluation. Then, lookup in the database the closest matches to the newly modified N input pictures. This can probably be done very fast with Locality-sensitive hashing or some other indexing system. The last step is the training of the actual network weights. Use the newly found pictures as inputs and predict "O" and "L", which, again, is the loss on the whole "O". This way, the "L" value, being oblivious to the gimmick, will always try its best to be accurate and won't skyrocket nor dive to zero. Same thing for "O". I think "L" also represent a measure of how confident the network is, which might be occasionally useful. submitted by /u/OuchieOnChin [link] [comments]  ( 1 min )
    [D] Resources to learn Deep Learning theory
    I want to improve my understanding of Deep Learning theory in areas like why does Gradient Descent work, interpolation vs. generalization, loss landscapes and many more. What are resources (books, papers, blog posts, etc.) that you used to get a better understanding of the theory behind Deep Learning? submitted by /u/optimized-adam [link] [comments]  ( 1 min )
    [R] The second (advanced) volume of Murphy's "Probabilistic Machine Learning" is now available in draft form!
    For anyone curious. Announcement on Twitter: https://twitter.com/sirbayes/status/1498402522511253510 Download link: https://probml.github.io/pml-book/book2.html submitted by /u/bikeskata [link] [comments]
    [P] Why does regularizing the bias lead to underfitting in neural networks?
    Hi everyone! I am a PhD student at the Indian Institute of Technology Kharagpur (IIT Kgp), working in the domain of understanding neural networks from a theoretical perspective. I am really passionate about DL and ML and wanted to give my own spin on many of the things I have learned while working through my Masters' and now PhD. I also get a lot of new ideas that aren't flashy enough to be put down in the form of a research paper but (I feel) would somewhat help the ML community, in general. Hence, as a creative outlet, my friend (a researcher at Amazon India) and I created this website, https://www.deepwizai.com/ deepwizai.com is essentially a collection of our best ideas in the form of blogs where we explain different concepts in both DL and ML. We also publish our own new algorithms and explain them right from the stage of conceptualization to final evaluation and experimentation. In the following post: Why does regularizing the bias lead to underfitting in neural networks? — deepwizAI we provide a simple explanation, backed with code and visuals, as to why taking the bias term in regularization leads to underfitting. Do leave a review about the post and the website in general! Thank you so much for reading through all this and have a great day! ​ This GIF has been sourced from the author's website submitted by /u/atif_hassan [link] [comments]  ( 1 min )
    [P] Survey for data annotation business plan school project
    Hello, everyone! My classmates and I have to create a business plan as part of our entrepreneurship class and we are working on something in the data annotation market. Right now, we need some information about the needs of people who work with machine learning and therefore need these services. If you have ever worked on a project where data annotation was needed, I kindly ask you to answer the very small survey linked below. It shouldn't take more than 3 minutes of your time. https://forms.gle/S6HXBsW8Cc9f8orp8 Thank you! submitted by /u/gamarrit [link] [comments]  ( 1 min )
    [R] How does a spatial basis applied to images work?
    I am trying to understand what the processings in this function imply. If you take this function and apply it to an RGB image, what exactly is going on? Thanks! def get_spatial_basis(h, w, d): """Gets a sinusoidal position encoding for image attention.""" half_d = d // 2 basis = np.zeros((h, w, d), dtype=np.float32) div = np.exp( np.arange(0, half_d, 2, dtype=np.float32) * -np.log(100.0) / half_d) h_grid = np.expand_dims(np.arange(0, h, dtype=np.float32), 1) w_grid = np.expand_dims(np.arange(0, w, dtype=np.float32), 1) basis[:, :, 0:half_d:2] = np.sin(h_grid * div)[:, np.newaxis, :] basis[:, :, 1:half_d:2] = np.cos(h_grid * div)[:, np.newaxis, :] basis[:, :, half_d::2] = np.sin(w_grid * div)[np.newaxis, :, :] basis[:, :, half_d + 1::2] = np.cos(w_grid * div)[np.newaxis, :, :] return basis submitted by /u/No_Possibility_7588 [link] [comments]  ( 1 min )
    [D] Sentiment Analysis, Part 3— Data annotation
    Good morning/afternoon! ☀️ I am back with the 3rd blog post of the series about Sentiment Analysis. This one is about Data Annotation, which is a concept that doesn't only apply to Sentiment Analysis but to any projects in which you need annotated data. If you don't know what is Data Annotation or if you want to know more about it, I suggest you take a look at this blog post: https://medium.com/besedo-engineering/sentiment-analysis-part-3-data-annotation-d0845d1a8a1b 📖 I hope this blog post will be useful to you and, as always, any feedback is welcome 🙌 submitted by /u/jade_mlc [link] [comments]  ( 1 min )
    [N] New Transport Planning Dataset for Deep Graph Neural Networks
    Hello everybody! I just published a graph dataset from strategic transport planning aimed at creating deep GNNs: https://github.com/nikita68/TransportPlanningDataset ​ Its based on augmented & synthetic data with the targets based on simulations by domain experts. The aim is to create a surrogate to overcome a number of previous drawbacks. From the GNN perspective, it really highlights some shortcomings, such as the need for dynamic depth GNNs as well as to overcome the bottleneck problem of depth in graphs. If you have any questions or feedback, please let me know! submitted by /u/ggNikita [link] [comments]  ( 1 min )
    [D] How much pooling should you do in a CNN?
    Typically the last layers of a CNN involve flattening the image into a single vector and passing that through one or more Dense layers. My question is, how long should that flattened image vector be? If you used no pooling layers at all, the flattened vector will be of length h*w*c, where h and w are the height and width of your image and c is the number of channels. Usually this is too large to be practical to feed into a Dense layer, for example for 128x128 RGB images the flattened length is ~50000. I believe this is why we use pooling layers to reduce dimensionality (adding pooling layers or increasing size of the pooling window both reduce dimensionality). Is there a good rule of thumb for how much pooling is absolutely necessary before flattening? If the flattened vector is 10000, is that too big? What about 1000? What is the best way to make this decision? submitted by /u/woodenlathe [link] [comments]  ( 1 min )
    What is the difference between BERT available on tenserflow hub versus BERT available on huggingface hub [D]
    I’m working in the field of Natural Language Processing. Recently I had to fine tune BERT, which is a pretrained model for generating contextualised sentence embeddings. While going through some tutorials, I came across BERT being fine tuned in two ways, one of which is using the model from tensorflow hub and the other from huggingface hub. What is the difference between the two ? submitted by /u/Literary_Lava [link] [comments]  ( 1 min )
    [D] How can I interpret Shapley values if I don't have a background in game theory?
    One limitation of the Shapley value pointed out in the following post/paper was that for non-experts in game theory, theoretical interpretations are not intuitive and not useful. https://www.reddit.com/r/MachineLearning/comments/t10cuy/r_the_shapley_value_in_machine_learning/ My goal is to understand what input features are more important to a neural network's predictions (LSTM in my case). The SHAP Python library makes it easy to calculate SHAP values. Since my inputs are time series, I get a SHAP value for every time step of every input feature. I can then see, using the absolute of the SHAP value, what time steps of what input features are contributing more to the model’s predictions, which is exactly what I want. Now, I want to identify certain time-steps and/or input features as being most important for predictions on my test dataset. Is it reasonable to set a threshold for the absolute of the SHAP value, say at 50% of its range, and claim that the time-steps and/or features above this threshold were most important for the predictions? Is there a better way to accomplish this? submitted by /u/baigyaanik [link] [comments]  ( 2 min )
    [R] New computer vision challenges on assistive tech for blind/low-vision, with prizes of 5K USD cash + 30K USD Azure Credits
    We are pleased to announce the fourth annual VizWiz Grand Challenge workshop, which will be held at CVPR 2022. The workshop is dedicated to developing computer vision algorithms to assist people who are blind and low-vision. This year, the workshop introduces 3 challenges (with prizes for top teams!): Visual Question Answering (VQA) Challenge, which focusses on answering a question asked by a person who is blind/low vision about an image they've taken on their mobile phone. VQA Grounding Challenge, which focusses on locating where the model is "looking" in the image when it answers the user's question. Few-Shot Object Recognition Challenge, which focusses on learning a personalised object recogniser for a person who is blind/low-vision by using just a few video clips they've captured of their personal objects on their mobile phone. Prizes - $30,000 in Microsoft Azure credit + $5,000 in cash will be awarded to top-performing teams across the three Challenges. Please see individual Challenge pages for more details. Deadlines Monday, February 7. Submissions open. Friday, May 6 [9:00am Central Standard Time]. Submissions close. Monday, June 20: all-day workshop at CVPR. submitted by /u/eee-vaaah [link] [comments]  ( 1 min )
    [N] TorchStudio, a free open source IDE for PyTorch
    Hi, after months of closed beta I'm launching today a free, open source IDE for PyTorch called TorchStudio. It aims to greatly simplify researches and trainings with PyTorch and its ecosystem, so that most tasks can be done visually in a couple clicks. Hope you'll like it, I'm looking forward to feedback and suggestions :) -> https://torchstudio.ai submitted by /u/divideconcept [link] [comments]  ( 3 min )
    [P] Gesture recognition system: TinyML on 8-bit microcontroller
    I built a gesture recognition system based on an accelerometer (for simplicity, only 2 motions: a binary classification problem) on Arduino Mega2560 (based on 8-bit microcontroller but with large memory space compared to other boards of Arduino family) using the Neuton TinyML framework. The article that describes this experiment in detail is published on Toward Data Science: https://towardsdatascience.com/ultra-tinyml-machine-learning-for-8-bit-microcontroller-9ec8f7c8dd12 Your feedback and questions are welcome! submitted by /u/Leo_Cav [link] [comments]  ( 1 min )
    [D] Thoughts on Lex Fridman Podcast episode with Mark Zuckerberg
    I felt this was the summary of the whole conversation: Lex asks a difficult question to Mark, Mark says that yeah that is something that happened with Y too when we didn't even have internet, and it is bad but everything has its strengths and weaknesses. Followed by "we've teams who are doing research on this, and according to research X it is actually the opposite of what the allegations say". Finally, "no other company in this space does this, and they should". It's like he memorized a template before coming to the podcast. I mean that's a very nice way to cover up :) What are your thoughts? submitted by /u/frankhart98 [link] [comments]  ( 4 min )
    [R] Robotic Telekinesis: Controlling Multifingered Robotic Hand by Watching Humans on Youtube (link in comments)
    submitted by /u/pathak22 [link] [comments]  ( 2 min )
    [P] You Only Plot Once (YOPO) -> Simple low code visualization library
    Hello All, Have you been in a scenario where you want to make simple plots just understand the datasets? But setting up and writing code for all the plots take lots of time googling and coding? ​ I have been facing this issue for a ling time. So I have developed a very low code dashboard which generates many simple plots plots. The tool is developed with Plotly Dash and some CSS tricks. With this library you can easily make Scatter plot, Bar graph, Histogram with one line of code. So, YOPO is an ~0 Code interactive dashboard which generates various standard plots. You can create various graphs and charts with a click of a button. ​ https://preview.redd.it/34fhzi2v5ik81.png?width=1280&format=png&auto=webp&s=48c6c3785fb3aa8624b5af65dd0ad3f90a22b6de Please try the tool and let me know in anyway can I improve the tool. Github Source Code : https://github.com/chekoduadarsh/YOPO-You-Only-Plot-Once Pypi: https://pypi.org/project/yopo/ Colab Demo: https://colab.research.google.com/github/chekoduadarsh/YOPO-You-Only-Plot-Once/blob/master/example.ipynb#scrollTo=YfA3eYgyz7gv Web Demo: https://csvvis.herokuapp.com/ If you like it please star it (This will help me motivate it to make it more amazing!!!) submitted by /u/Mrshadow143 [link] [comments]  ( 1 min )
    [D] What is your setup for setting up and monitoring experiments in the cloud?
    I often use Google Cloud VMs for my deep learning experiments, since they have V100s I can use. My typical setup is that I copy the files I need, and run nohup python3 train.py & and then exit, periodically checking for progress. Is there a better way? How do you set up your cloud environments? submitted by /u/FlyingQuokka [link] [comments]  ( 1 min )
    [P] A Python framework for evaluating AI/ML Wordle agents
    I put together a Python package to help write and evaluate Wordle bots. Let me know what you think! Feedback welcome. https://peterbbryan.medium.com/introducing-wordle-benchmark-423f4f6d38a3 https://github.com/peterbbryan/wordle-benchmark https://pypi.org/project/wordle-benchmark/ https://i.redd.it/cgjp5x0nohk81.gif submitted by /u/pbbryan [link] [comments]  ( 1 min )
    [D] What is state of the art in auto encoders for image generation ?
    I'd like to generate images on low resources machine, with a very compressed low dimensional latent space. Do you have any architectures in mind other than traditional Variational Auto Encoders ? submitted by /u/Silver_Doughnut_8175 [link] [comments]  ( 1 min )
  • Open

    Federated Learning with Formal Differential Privacy Guarantees
    Posted by Brendan McMahan and Abhradeep Thakurta, Research Scientists, Google Research In 2017, Google introduced federated learning (FL), an approach that enables mobile devices to collaboratively train machine learning (ML) models while keeping the raw training data on each user's device, decoupling the ability to do ML from the need to store the data in the cloud. Since its introduction, Google has continued to actively engage in FL research and deployed FL to power many features in Gboard, including next word prediction, emoji suggestion and out-of-vocabulary word discovery. Federated learning is improving the “Hey Google” detection models in Assistant, suggesting replies in Google Messages, predicting text selections, and more. While FL allows ML without raw data collection, differe…  ( 8 min )
    Constrained Reweighting for Training Deep Neural Nets with Noisy Labels
    Posted by Abhishek Kumar and Ehsan Amid, Research Scientists, Google Research, Brain Team Over the past several years, deep neural networks (DNNs) have been quite successful in driving impressive performance gains in several real-world applications, from image recognition to genomics. However, modern DNNs often have far more trainable model parameters than the number of training examples and the resulting overparameterized networks can easily overfit to noisy or corrupted labels (i.e., examples that are assigned a wrong class label). As a consequence, training with noisy labels often leads to degradation in accuracy of the trained model on clean test data. Unfortunately, noisy labels can appear in several real-world scenarios due to multiple factors, such as errors and inconsistencies in …  ( 9 min )
  • Open

    Training an agent with Curriculum learning
    I am training a tank agent to navigate the map, pick up different kinds of power ups and shoot targets. My biggest challenge to date is that I started training the agent treating the aforementioned behaviors like separate behaviors/separate brains. Before starting again with all my knowledge gained, I want to gather more tips and info on how I could make this easier on myself and on my (pretty old) latop. https://pastebin.com/ZE7iQ8yq Here is also a pastebin with the documentation of my progress up to this point. Any help is appreciated submitted by /u/JamaicanXanax [link] [comments]  ( 1 min )
    DDPG outputs are saturated
    Hello there ! I need your help ! I am working with DDPG with 2 continuous actions and state size of 15. The model must output actions that range from 0 to 1 so after the output of a Tanh Layer I modified the action as actions = (actions + 1)/2. I tried many values for the hyperprameters. However, after few episodes the agent output [1,0] or [0,1]. one thing I´m not sure about it is the update frequency because I have a large dataset, the number of steps in each episode is 15 000 steps. And I set update each 100 steps https://preview.redd.it/2d2887xeolk81.png?width=343&format=png&auto=webp&s=0b88820a235a9122c94f4e84dc04caeb16b90795 The actor and critic : class Actor(nn.Module): super(Actor, self).__init__() self.seed = torch.manual_seed(seed) self.bn_input = nn.BatchNorm1d(state_si…  ( 1 min )
    Help installing Garage[rlworkgroup/garage] on Jupyter/Colab!
    I'm currently following the example inside jupyter folder but I get error while installing mujoco-py. submitted by /u/Big-Picture8323 [link] [comments]  ( 1 min )
    Ensuring a diversity of actions?
    Obviously there's tonnes of ways to balance exploration and exploitation to (hopefully) explore enough of the correct portions of the action space to converge to an optimal policy. However, in environments where there are many optimal actions, is there any any way to ensure that the optimal policy explores each of them - instead of focusing on one of the actions? submitted by /u/99StewartL [link] [comments]  ( 1 min )
    Dual Actor Critic : How to design policy in AC methods with dual actors and a single critic . Do some papers employ this technique ?
    submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    Whitepaper on public policy for reinforcement learning
    See here: https://cltc.berkeley.edu/reward-reports/ The authors propose a new kind of AI documentation, Reward Reports, that track what an RL system is optimizing for over time as it learns the dynamics of a domain and (as the case may be) actively reshapes them to conform to its specification. submitted by /u/silvdraggoj [link] [comments]  ( 1 min )
  • Open

    Text to Audio AI
    submitted by /u/Active-Heron-8898 [link] [comments]
    Audio to Text AI
    submitted by /u/Active-Heron-8898 [link] [comments]
    Text to Video AI
    submitted by /u/Active-Heron-8898 [link] [comments]
    Video to Text AI
    submitted by /u/Active-Heron-8898 [link] [comments]
    Image to Text AI
    submitted by /u/Active-Heron-8898 [link] [comments]
    Future of A.I. in Neurosurgery
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    What is neural architecture search?
    submitted by /u/bendee983 [link] [comments]
    Digital Antinatalism: Is It Wrong to Bring Sentient AI Into Existence?
    submitted by /u/iamtheoctopus123 [link] [comments]  ( 2 min )
    What is TorchRec?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    What are AIs that are able to render funny stuff?
    such as the AI that is rendering feed pics. submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Do you need Math for Machine Learning? If YES, what part of math do you need? The link to free resources are mentioned in the description of the video.
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Tiny ML Air Writing Recognition with Nicla Sense ME
    submitted by /u/literallair [link] [comments]
    TinyML Monitoring Air Quality an 8-bit Microcontroller
    I’d like to share my experiment on how to easily create your own tiny machine learning model and run inferences on a microcontroller to detect the concentration of various gases. I will illustrate the whole process with my example of detecting the concentration of benzene (С6H6(GT)) based on the concentration of other recorded compounds. Things I used in this project: Arduino Mega 2560, Neuton Tiny ML software To my mind, such simple solutions may contribute to improving the air pollution problem which now causes serious concerns. In fact, the World Health Organization estimates that over seven million people die prematurely each year from diseases caused by air pollution. Can you imagine that? As such, more and more organizations, responsible for monitoring emissions, need to have effe…  ( 3 min )
    Hey Folks, In the latest blog GitHub explains how its Code Scanning Now Uses Machine Learning (ML) to Alert Developers To Potential Security Vulnerabilities in Their Code
    Every piece of software and code contains flaws. While some of these flaws are minor and simply impair an application’s functioning, others have the potential to compromise its security. It is important to find and remedy these security flaws for application security. Code scanning is one such framework that now uses machine learning for detecting potential security flaws in software that identifies vulnerabilities and corrects them before they are released into production, reducing the security risks they offer. Continue reading our summary or you can also read Github blog https://reddit.com/link/t37lvz/video/wsiqmbaxbik81/player submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Connecting powers of two and decibels
    Colin Wright pointed out a pattern in my previous post that I hadn’t seen before. He wrote about it here a couple years ago. Start with the powers of 2 from the top of the post: 21 = 2 22 = 4 23 = 8 24 = 16 25 = 32 26 = 64 27 […] Connecting powers of two and decibels first appeared on John D. Cook.  ( 2 min )
  • Open

    Abundance Mentality is Key to Exploiting the Economics of Data
    Why do data silos, and now analytic silos, continue to exist?  It can’t be due to technical issues.  Data silos appeared in the 1990s when we were trying to make Relational Data Base Management Systems (RDMBS) – that were architected for single-record transaction processing – perform massive table scans to identify the trends, patterns, and… Read More »Abundance Mentality is Key to Exploiting the Economics of Data The post Abundance Mentality is Key to Exploiting the Economics of Data appeared first on Data Science Central.  ( 7 min )
    Application Integration vs Data Integration: A Comparison
    Application integration and data integration are approaches taken by organizations to utilize data from different systems, but they meet different needs. They are the two of the most talked-about concepts in the world of IT and data, essentially involving data management. Yet, the concepts of application integration and data integration are deeply different, especially when… Read More »Application Integration vs Data Integration: A Comparison The post Application Integration vs Data Integration: A Comparison appeared first on Data Science Central.  ( 3 min )
    Why We Can’t Trust AI to Run The Metaverse
    New study highlights key metrics for measuring trust. Web 2.0 trust issues will bleed over to the metaverse. One solution is the change how we measure trustworthiness. Artificial Intelligence is at the core of the Metaverse, but how trustworthy is it? Although there have been many previous studies into AI’s trustworthiness in Web 2.0, these… Read More »Why We Can’t Trust AI to Run The Metaverse The post Why We Can’t Trust AI to Run The Metaverse appeared first on Data Science Central.  ( 4 min )
    Could an explainable model be inherently less secure?
    In  a week when privacy is very much on the agenda, we ask how can we protect AI models. This is already a mature field but still not on the radar of most developers. Recently, ENISA published a report called securing machine learning applications (link below), which gave a good summary of the key threats… Read More »Could an explainable model be inherently less secure? The post Could an explainable model be inherently less secure? appeared first on Data Science Central.  ( 5 min )
  • Open

    Eastern European Guide to Writing Reference Letters
    Excruciating. One phrase I often use to describe what it's like to read reference letters for Eastern European applicants to PhD and Master's programs in Cambridge. Even objectively outstanding students often receive dull, short, factual, almost negative-sounding reference letters. This is a result of (A) cultural  ( 6 min )
  • Open

    Getting back into studying, to finish off a dissertation. Should I buy this course?
    I have a dissertation (for a Master's programme) to complete in a few months. I've had to defer it for a few years because of some personal issues (ie. resitting a module, babies, pandemic), so getting back into the grind of things isn't easy. From my last attempt at doing this, I'm fairly certain that I'll be basing it on neural networks somehow. Maybe classifying some medical image database? So, I found this course, apparently it (a) walks you through the theory with exercises in actually coding, and (b) the project at the end is a medical imaging project, which looks similar to what I was thinking of doing (I was thinking of doing something with diabetic retinopathy after a suggestion. Going with it because it seems like a nice, clear cut problem suitable for a Master's project). Machine Learning, Data Science and Deep Learning with Python https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on It's for £16. What do you guys thing? Shall I go for it? (and, if I could cheekily add, any discount codes? ;) ) Thank you :o) submitted by /u/nadapol [link] [comments]  ( 2 min )
    I’m working on building emotion classification from scratch
    Our problem statements tells us to use 2 hidden layers of 100 neurons each and classify it to one of the 7 labels which are one hot encoded. My training data X has around 1000+ images and each image is 48x48 matrix. Is it possible to process each image with two hidden layers since I’m unable to find any combination to get 7x1 matrix or should I just use another hidden layer to get the desired output submitted by /u/Actual-Performer-832 [link] [comments]  ( 1 min )

  • Open

    Courses needed to become proficient in reinforcement learning
    Hi all, I'm new to reinforcement learning. I'd like to learn this type of machine learning training method fundamentally; that is, I want to initially be able to use "pen and paper" to work through simple reinforcement problems before using programming languages. My goal is to find and structure online courses (ideally open courses) that will teach me the material and concepts that I need. My big current issue right is that I don't know what pre-requisites I'm not proficient in. Pretend that I'm an undergraduate student beginning my freshman year. Any help is appreciated. submitted by /u/Glad-Amphibian-8636 [link] [comments]  ( 1 min )
    Teaching an AI How to KILL
    submitted by /u/Stochastic_Machine [link] [comments]
    How to train multi agents with PPO/DQN for playing Atari Game
    Hi, I would like to train two agents with either PPO or DQN, so that they are able to play a competitive Atari game like Pong or Slimevolleyball. I wanted to train them using Stable Baselines, but I could not find an example for training two agents. Then I found RLlib but this lib does not provide an example for the two games named before and I struggled with configuring a new environment. Is there any way I can train two agents with PPO or DQN in the gym environment using one of these Libraries? Thank you! submitted by /u/theresa_k96 [link] [comments]  ( 1 min )
    question about the recurrent
    I have a question about the recurrent. Under the recurrent setting, there is not the computation graph betweent step t and step t-1 , how to backprop gradient? https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py#L145 submitted by /u/Traditional-Ad4492 [link] [comments]  ( 1 min )
    Reference about training an agent on multiple environment
    Hello, I am searching good references on the practice of training one agent on many different tasks at the same time, does someone have recommendations ? :) Thanks! submitted by /u/krenast [link] [comments]  ( 1 min )
    Robbins-Monro approximation for expectation
    submitted by /u/promach [link] [comments]  ( 1 min )
  • Open

    How to animate still images - FILM: Frame Interpolation for Large Motion, a 5-minute paper summary by Casual GAN Papers
    Motion interpolation between two images surely sounds like an exciting task to solve. Not in the least, because it has many real-world applications, from framerate upsampling in TVs and gaming to image animations derived from near-duplicate images from the user’s gallery. Specifically, Fitsum Reda and the team at Google Research propose a model that can handle large scene motion, which is a common point of failure for existing methods. Additionally, existing methods often rely on multiple networks for depth or motion estimations, for which the training data is often hard to come by. FILM, on the contrary, learns directly from frames with a single multi-scale model. Last but not least, the results produced by FILM look quite a bit sharper and more visually appealing than those of existing alternatives. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/268 Blog post: https://www.casualganpapers.com/motion-interpolation-image-animation-implicit-model/FILM-explained.html FILM arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries!h submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    Hey Folks, Here is a really cool research by Deepmind researchers where they probe image-language transformers and propose SVO probes for verb understanding
    Fine-tuning is required for a range of functions performed by multimodal image–language transformers. Researchers worldwide are interested in whether these algorithms can detect verbs or merely use the nouns in a phrase. A dataset of image-sentence pairs with 447 verbs that are either visible or widely encountered in the pretraining data was compiled to perform this task. Researchers from DeepMind propose to evaluate the pretrained models in a zero-shot manner using this dataset. In comparison to other elements of speech, it is found that the pretrained models underperform more in scenarios that demand verb interpretation. An investigation into which verbs are the most difficult for these models is underway. You can read this full summary here But if you are interested in only paper then you check the paper here https://preview.redd.it/wvg7cmlxyek81.png?width=1548&format=png&auto=webp&s=00a33a8d30b1285780d13fce2d271586852d7736 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Like Futurama but...
    submitted by /u/AutoMeta [link] [comments]
    Pytorch Introduces ‘TorchRec’: A Python-based PyTorch Domain Library For Recommendation Systems (RecSys)
    Recommendation Systems (RecSys) are a big part of today’s production-ready AI, although you wouldn’t know it from glancing at Github. In contrast to domains like Vision and NLP, most of RecSys’ continuous discovery and development takes place behind closed doors. The field is far from democratized for academic researchers exploring these approaches or creating individualized user experiences. RecSys as a field is also defined by learning models over sparse and/or sequential events, which has a lot of overlap with other AI fields. Many of the approaches, particularly those for scalability and distributed execution, are portable. RecSys approaches account for a significant amount of worldwide AI investment; therefore, enclosing them prevents this money from moving into the larger AI field. TorchRec is a new PyTorch domain library for Recommendation Systems. This library includes standard sparsity and parallelism primitives, allowing researchers to create and implement cutting-edge customization models. CONTINUE READING Github: https://github.com/pytorch/torchrec https://preview.redd.it/u5mu339xkak81.png?width=1536&format=png&auto=webp&s=3f700fc09c9e3491827b9f4509f073a56bf84336 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    [D] How to animate still images - FILM: Frame Interpolation for Large Motion, a 5-minute paper summary by Casual GAN Papers
    Motion interpolation between two images surely sounds like an exciting task to solve. Not in the least, because it has many real-world applications, from framerate upsampling in TVs and gaming to image animations derived from near-duplicate images from the user’s gallery. Specifically, Fitsum Reda and the team at Google Research propose a model that can handle large scene motion, which is a common point of failure for existing methods. Additionally, existing methods often rely on multiple networks for depth or motion estimations, for which the training data is often hard to come by. FILM, on the contrary, learns directly from frames with a single multi-scale model. Last but not least, the results produced by FILM look quite a bit sharper and more visually appealing than those of existing alternatives. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/268 Blog post: https://www.casualganpapers.com/motion-interpolation-image-animation-implicit-model/FILM-explained.html FILM arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [R][P] Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut + Hugging Face Spaces Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Beef up the workstation or move to the cloud?
    Hi there, I am new to the ML area and would like your feedback. I am at a turning point, either beefing up my workstation or using cloud providers like AWS, GCP, Azure or smaller providers like paperspace. Looking for advice on which way to go. If you are using the cloud providers what pitfalls to be aware of or other useful pointers. So far I have more seriously looked into the AWS sagemaker and it's a “bit of a learning curve” to get started with. The google colab and sagemaker studio lab yet again are too limited for my needs. Thanks submitted by /u/Academic_Arrak [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 132
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 1 min )
    [D] Terminology discussion: Does "model" refer to the raw or to the fully trained mathematical method?
    When looking into Machine Learning models, I came across something that confuses me: On the one hand "model" is referred to (e.g. by Google) as something you pick before you start the training process, meaning the raw mathematical method (e.g. linear regression). On the other hand "model" is referred to (e.g. by Google) as the fully trained mathematical method, so the mapping from input to output that you get after training (e.g. the result of conducting a linear regression on the training data). Is both correct? If not, how would you call the remaining thing from the two options above? submitted by /u/karlssonvomdach [link] [comments]  ( 1 min )
    [D] ML in the real-world : How did you solved a business problem using ML?
    Solving a business problem can be tricky at first glance, no one ever comes to you saying "I need a classifier to classify this and that" or "I'm want a recommendation system to help me do that". Have you any problem you have translated into ML-terms and solved it? How was the outcome? What data did you used? In which context your project was set in if any? What did you use to solve the problem (tool, algos...)? submitted by /u/charlesaten [link] [comments]  ( 1 min )
    [D] How satisfied are you with the current Interpretability methods for Neural Networks?
    In the past few years, papers on interpretability for neural networks have been an important part of top-tier ML/CV conferences. I would like to know how satisfied is the ML community with this progress. View Poll submitted by /u/aritipandu_san [link] [comments]  ( 1 min )
    [P] I made the kind of tutorial I wish someone had made for me when I first started trying to connect the math in research papers with code examples I found online
    submitted by /u/clam004 [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 2 min )
    Constant validation metrics [D]
    I am training a neural network. The learning curves for the training metrics change over time, but the validation metrics are always constant. Originally I thought it might be over fitting but I changed the learning rate and it is still constant. Any advice for fixing this issue would be appreciated. submitted by /u/Dog-eating-chicken [link] [comments]  ( 1 min )
    [D] evolutionary algorithms roadmap
    after watching this video i want to dive more deep into the topic and create such algorithms. questions: is there enough research on this topic so i could do on higher education? are these methods being used in AI/ML? how much is this methods explored? is it just taking off and not much research is done or they found equivalent methods in order to simulate such procedures and get the outcome? what things should i learn in order to get my hands into this topic? i took linear algebra and numerical analysis this semester and I'm trying to go toward more advanced topics (to be honest i just studied in past 2 years, had no job and didn't choose a topic to dive deep and specialize myself (some people do web and i hate development, also i like academia route) submitted by /u/sweetchocolotepie [link] [comments]  ( 1 min )
    [D] Favorite paper integrating Causality and Machine Learning
    Hi, For Causality-inspired machine learning and Machine-learning inspired causality research, I think it is difficult to understand what the state-of-the-art is for subproblems in these domains given the lack of transparent and unambiguously good benchmarks. I also haven't seen best paper awards for work in the above areas at least not in a major machine learning conference. I am curious for those familiar with this work, what are your favorite papers from one of or both areas even if it focuses on a relatively applied problem (i.e problems in economics). It would also be helpful if you could provide some rationale. submitted by /u/HybridRxN [link] [comments]  ( 1 min )
    [Discussion] Visualization for Machine Learning as a scrub
    I'm trying to build a small portfolio for personal projects as part of my college application, and I was wondering if there was a way to visualize the data in a sense? New to ML so not sure if its even necessary besides making it more complicated. submitted by /u/Risk-Personal [link] [comments]  ( 1 min )
    [D] Two steps model for noisy data
    Suppose you have a labeled dataset with noisy data. How can you model for that? I have found this very interesting reply from this question: https://www.reddit.com/r/MachineLearning/comments/dgnpcb/d_learning_with_noisy_data_but_perfect_labels/ which says: You model for it. You can encapsulate this "junk probability" in a prior. You can even do a kind of two stage model where you have a first stage to predict whether or not the observation is total garbage to isolate it from the observations presented to your more "informed" model. It really depends on the situation. In general, the strategy is to capture this kind of noise in your variance inference. You make a prediction on an observation, but you don't report it as a point estimate: you report it with a prediction interval. Wide intervals indicate low confidence predictions, which result from noisy data. In the two stage model, you might have two different intervals: one for the prediction that the observation was useful at all, and a second interval for what your prediction is given the assumption that the observation was in fact useful. Breaking up the variance in this way should give you a (potentially deceptively) tighter conditional variance, which is what you ultimately want (tighter intervals equals more "explained variance"). I kind of understand, but I was curious if there is any source I can read more about this or some paper in which this is applied. submitted by /u/macoit18 [link] [comments]  ( 1 min )
    [D] Growing your "brain trust" - how are you networking while WFH?
    You're as good as the five closest people in your network the saying goes...and this extended WFH period has slowly diminished my network. I'd love to hear your experiences building a network of supportive DS and engineering peers. submitted by /u/Shap177 [link] [comments]
    [D] Recent Paper / Lecture Recommendations on classical ML (not DL)
    Where I work, we have a bi-weekly research meeting where we discuss papers or watch lectures together, and soon it's my turn to come up with something. So I'm looking for new exciting work that might be relevant to us. It's a fintech company, and we mostly use tabular data with classical models (Logistic, boosting trees, etc.) If anything comes to mind, please let me know Thanks! submitted by /u/shgoren [link] [comments]  ( 1 min )
    [D] Meta AI Residency interview
    Hey everyone. I have my first interview for a AI Residency position at Meta soon. The information i have recieved both mention that i should be prepared for both normal SWE interview questions (Leetcode etc) and more machine learning foucsed interviews (calculating metrics, data cleaning etc.) If anyone has been through this process recently or something similiar i would appreciate it if you could share your experience or share some tips for how i should be preparing. Thanks in advance. submitted by /u/WholeReward9042 [link] [comments]  ( 1 min )
    [D] Running your computationally heavy code on a remote server
    I have a general question regarding running computationally heavy computer vision algorithms on remote servers. So far I have been working with smaller datasets which takes only a few hours to run and yet I have faced many situations where the process broke down due to a multitude of reasons. Maybe it was my internet connection or my VSCode got disconnected? I wonder how people set their computational environment so they can run a process without trouble for days? other than checkpoints, which methods are common practice? I would be glad to hear about your opinion submitted by /u/Substantial-Ad4518 [link] [comments]  ( 1 min )
    [R] A Program to Build E(N)-Equivariant Steerable CNNs - Link to a free online lecture by the author in comments
    submitted by /u/pinter69 [link] [comments]  ( 2 min )
    [P] I made live digit recognizer with machine learning and web technologies. (Live Demo in Comments...)
    submitted by /u/dusklight00 [link] [comments]  ( 2 min )
    [R][P] Why does TfidfVectorizer from Sklearn create a document-term matrix and not a term-document matrix?
    Why does TfidfVectorizer from Sklearn create a document-term matrix and not a term-document matrix? When we use TFIDF for Latent Semantic Analysis we begin with a term-document matrix, not a document-term matrix. submitted by /u/Arup15 [link] [comments]  ( 1 min )
    [D][R] People in the industry, what image resolution do you use?
    In computer vision, the most common datasets to benchmark the new method against is CIFAR and ImageNet, which have the resolution of 32x32 and 224x244 (generally) respectively. Some methods run on larger ImageNet resolution, say 384. That resolution is small, compared to the normal image we usually take with our phones (720p, 1080p). What image resolution do you use in industry? submitted by /u/AerysSk [link] [comments]  ( 2 min )
  • Open

    100 digits worth memorizing
    I was thinking today about how people memorize many digits of π, and how it would be much more practical to memorize a moderate amount of numbers to low precision. So suppose instead of memorizing 100 digits of π, you memorized 100 digits of other numbers. What might those numbers be? I decided to take […] 100 digits worth memorizing first appeared on John D. Cook.  ( 2 min )
  • Open

    Amazing Benefits of Big Data Analytics Career
    Big data is all over the place! and it has the ability to provide us with far more than we expect. Be it:  A hospital that gathers information in order to assess specific outcomes and improve outcomes.  To enhance the appeal, a business collects data from customers.  Users’ data is collected by businesses to improve… Read More »Amazing Benefits of Big Data Analytics Career The post Amazing Benefits of Big Data Analytics Career appeared first on Data Science Central.  ( 4 min )
    Approach to AGI: how can a “synthetic being” learn?
    Three inspirational articles from Nikko Ström about Symbolic AI [1], Melanie Mitchell about AI-based Understanding [2] and Yoshua Bengio about the Conscious Prior [3], have inspired me to write my own small contribution. Since some years ago I have been thinking of how our brain learns and its relationship to human being goal of approach… Read More »Approach to AGI: how can a “synthetic being” learn? The post Approach to AGI: how can a “synthetic being” learn? appeared first on Data Science Central.  ( 2 min )
  • Open

    Will learning how to program a neural network from scratch help me understand neural networks better?
    I tried pytorch and a bit of keras, and I just can't seem to comprehend it right now for some reason submitted by /u/Hananelroe [link] [comments]  ( 1 min )
    Need help with time series classification using transformers
    Hello everyone, I am following this link to get myself familiar with transformer networks. This example demonstrates a binary classification task about engines, calling them as faulty or non-faulty. I am currently working on classifying 9 different dynamic hand gestures. I have created my own dataset with mediapipe, extracting the 3D keypoint locations of my hands for these gestures for 30 frames. I have trained LSTM's and bidirectional LSTM's with attention layers in between, and they perform arguably well. ​ I have 9 classes and have created 90 samples for each one. Each keypoint on my hand has 3 features : normalized coordinates for (x,y,z) each hand has 21 keypoints, and I have two hands, which makes a total of 126 features at any given instance. I have gathered 30 frames of data for each sample. Thus, the shape of my entire input X without train and test split is (9*90 , 30 , 3*21*2) -> (810 , 30 , 126) ​ you can see that there are 30 rows and 126 columns for my X ​ Similarly, the shape of my output y is (810 , 9) where each row consists of 1's and 0's. ​ The outputs are stored like this, because the dataset is not randomized yet, the first 90 elements of y corresponds to a single gesture and similarly, the next 90 corresponds to another one. Whiel doing the train and test split, X and y orders are randomized. ​ Here is the randomized example for y_train ​ ​ ​ The example that I mentioned above works with a y vector that has a shape of (y,1) where y is either 0 or 1 because its a binary classification problem. When I feed my y matrix to this network, I get an error about shapes being incompatible. How should I go about this? When I convert this whole y matrix to a vector by using np.argmax to extract which index is maximum (1, and thus true label), I get such a vector (0,5,7,3,4,5...) as expected. When I try to train on this, loss never drops. Any help is greatly appreciated! I can provide any further detail in case it is needed. submitted by /u/ege6211 [link] [comments]  ( 2 min )
    Can we use Softmax activation function in hidden layers ?
    submitted by /u/Actual-Performer-832 [link] [comments]  ( 1 min )

  • Open

    [D] I have come up with a new importance sampling weighting method for active learning. How do I prove it is better than uniform sampling or entropy sampling?
    In active learning for posterior density estimation, two common baselines are uniform sampling, or sampling based on the most uncertain (ranked via entropy). How do I provide some theoretical justification for a new sampling scheme? submitted by /u/PK_thundr [link] [comments]  ( 1 min )
    [D] Paper Explained - Can Wikipedia Help Offline Reinforcement Learning? (Full Video Walkthrough)
    https://youtu.be/XHGh19Hbx48 Transformers have come to overtake many domain-targeted custom models in a wide variety of fields, such as Natural Language Processing, Computer Vision, Generative Modelling, and recently also Reinforcement Learning. This paper looks at the Decision Transformer and shows that, surprisingly, pre-training the model on a language-modelling task significantly boosts its performance on Offline Reinforcement Learning. The resulting model achieves higher scores, can get away with less parameters, and exhibits superior scaling properties. This raises many questions about the fundamental connection between the domains of language and RL. ​ OUTLINE: 0:00 - Intro 1:35 - Paper Overview 7:35 - Offline Reinforcement Learning as Sequence Modelling 12:00 - Input Embedding Alignment & other additions 16:50 - Main experimental results 20:45 - Analysis of the attention patterns across models 32:25 - More experimental results (scaling properties, ablations, etc.) 37:30 - Final thoughts ​ Paper: https://arxiv.org/abs/2201.12122 Code: https://github.com/machelreid/can-wikipedia-help-offline-rl My Video on Decision Transformer: https://youtu.be/-buULmf7dec submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Quantum Mechanics in Machine Learning
    I've been looking up ideas and models in ML, born out of inspirations from Quantum Mechanics phenomenons. I'm specifically talking about ideas like Bolzmann machines which are based on another physical phenomenon, the Ising model. I must be using the wrong terms and keywords because all my searches bring up research related to Quantum Neural Networks. I was wondering if anyone here is pre-acquiated with such works and would be kind enough to share them. submitted by /u/mfarahmand98 [link] [comments]  ( 1 min )
    [N] MIT Media lab founder describing the importance of personal data
    https://www.youtube.com/watch?v=sKhCKWR8OwI&list=PLhCQIYxdniNtEto-CO33yVb0ah7yKOVV5&index=5 submitted by /u/Corddot [link] [comments]
    [R][P] Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework + VQA Hugging Face Spaces Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Different ways to consolidate ML concepts/techniques/research/
    Aside from practical learning through hands on building ML models, etc., what memory techniques and note-taking methods do you go about retaining the concepts/techniques/research in ML and AI and the devices that you use to help consolidate into long term memory? I'm just seeing if there are other ways of retention specific to ML other than the traditional learning methodologies like spaced repetition, active recall, etc. that have helped you understand it and articulate it more eloquently in conversation. submitted by /u/cloudlessjedi [link] [comments]  ( 1 min )
    [D] Anomaly detection using only benign examples
    Dear fellow redditors. I recently started developing an anomaly detection system using only benign examples (think benign credit card transactions) and don't want to use known fraudulent transactions in order to be able to detect zero-day fraudulent activity for which we don't have data examples yet. In modern literature I've seen these approaches being used : - Local Outlier Factor - Autoencoders - Bayesian Networks - Random Cut Forest - Elliptic Envelope - Fuzzy Logic - Genetic Algorithms What are your thoughts and personal experience with zero-day anomaly detection ? And which techniques would you recommend, including some which may not be listed here. Thank you. submitted by /u/Meow7788 [link] [comments]  ( 1 min )
    [D] Using Gumbel Softmax for Discrete Action and Label selection. Is it the best option?
    If I want to train a classifier that selects from a set of let's say 5-labels and depending on a particular label (if-condition) there's another representation/embedding that is trained to predict another label and I want to train this whole model in an end to end setting; I would require something like Gumbel Softmax to sample the label for the first 5-label classifier right? Is this the only method? Actually, the problem is that there are 5 labels, and 1 of the labels has let's say 5 more sub types! So, I think I can just directly train using normal softmax? and based on the prediction like with teacher-forcing I can see if the sub-type label is predicted and if it is I just pass the input to the second classifier! It should work properly as well.. and still be differentiable! But, could gumbel softmax be used in a setting like this? and help? If someone has information on this, can you please point me towards other models/approaches/techniques for selection and yet having an end-to-end model? submitted by /u/goobuddy [link] [comments]  ( 2 min )
    [P] How to predict price.
    Hello, i want to use ml code to predict the patrolprice at my local patrol station. I have a dataset with this columns: variety | price | date | time I want to predict the price for example the next week, or next day. Is this possible to solve with h2o automl or how you would solve it? Do you have a tutorial for me? submitted by /u/Icy_Collection_1951 [link] [comments]  ( 1 min )
    [Announcement] AMA Friday morning EST (4th March) with Sebastian Raschka author of Machine Learning with Pytorch and SciKit-learn
    AMA friday morning EST with Sebastian Raschka /u/seraschka author of Machine Learning with Pytorch and Scikit-Learn Book It is on goodreads here https://www.goodreads.com/book/show/60098440-machine-learning-with-pytorch-and-scikit-learn?from_search=true&from_srp=true&qid=MNIHuvctFr&rank=1 And a link to the code and such here https://www.reddit.com/r/learnmachinelearning/comments/t1gqqe/machine_learning_with_pytorch_and_scikitlearn_book/ ​ Ask him questions about his new book, academic research, or his job at http://Grid.ai submitted by /u/cavedave [link] [comments]  ( 1 min )
    [R] Cloning a musical instrument from 16 seconds of audio (WIP)
    submitted by /u/More_Return_1166 [link] [comments]  ( 1 min )
    [R] What are some Text Similarity methods?
    So, first post here. I have been working with ML and Data Science for a year now. I used to work with classical Machine Learning majorly (Trees and Linear Regression). I recently shifted to a new job with focus on NLP and I've been given a task to optimise our product. I can't go and explain everything but here is brief and where I am stuck. The task is we're trying to have a field in our portal in which when someone posts a word(skill like Python, Management etc) we should get reccomended with a similar skill from our Data which we have. So, basically Text Similarity. Where, I am stuck is. I have looked at multiple papers and they majorly focus on sentence whereas I only have singular words. Two-Three methods I'm trying are 1. sense2vec 2. word2vec (which is not much useful) 3. Tensorflow Universal Sentence Encoder Would appreciate someone who can help me lead the way. submitted by /u/jeevanshud [link] [comments]  ( 2 min )
    [D] Lecture on Concise Slides
    I have been asked to make a short video series of the concise slides I am making. I would really love some guidance on this before I spend a lot of time making videos. Can you please let me know your opinions on this poll? If you haven't seen them yet, please check them out here: https://www.reddit.com/r/MachineLearningDervs/ Thank you for helping me :) View Poll submitted by /u/Ok_Can2425 [link] [comments]  ( 1 min )
    [R] I need help in identifying which requirements are needed to create human-centred AI?
    Hi everyone, I am currently studying my PhD in software engineering and my topic is on identifying Requirements Engineering techniques for AI systems. I am in the process of conducting a survey to investigate how current practices manage requirements when building AI systems. If you have worked as a Data Scientist, Machine Learning Specialist or a Software Engineer working with an AI component, I would really appreciate your assistance in completing my survey. We have received Deakin University’s Ethics approval (reference number: SEBE-2021-41- MOD01). This is an anonymous survey and would take approximately 10 to 15 minuets of your time. If you are interested in participating, please fill out our questionnaire at: https://researchsurveys.deakin.edu.au/jfe/form/SV_dd6vdkrNMUr8pfg submitted by /u/Positive-Object-7296 [link] [comments]  ( 1 min )
    [D] Model Extraction Attack
    Does anyone have experience with data-free model extraction (training a duplicate model without using the original data), especially for video models? submitted by /u/One-Yogurt7320 [link] [comments]
    [D] Self-serve for users
    How many of your models are actually deployed in automated environments, and what proportion is used locally that you then send out as a report/csv/dashboard etc.? I see a lot of conversations here that talk about production pipelines and deployments but I work with teams that do tons of analysis, especially for smaller projects, that are done locally. DS teams end up working on business requests that involve downloading files, building some analysis and then sending results out as csvs or dashboards. Is that true for you as well? Are there tools you use to publish non production models to business users (that can't code) to use on a self-serve basis? submitted by /u/dmart89 [link] [comments]  ( 1 min )
    [D] Paper Explained – How Do Vision Transformers Work?
    https://youtu.be/dOwRXpSSc8E In this video, we present the lessons of the recent "How do Vision Transformers work?" paper: https://arxiv.org/abs/2202.06709v1 Multi-head self-attention (MSA) and convolutions are complementary, or so it turns out. So what are the differences between MSA and Convs? Outline: 00:00 Transformers vs ConvNets 01:04 Sponsor: Weights & Biases 02:21 Convolutions explained in a nutshell 03:35 Multi-Head Self-Attention explained 06:46 Why we thought that MSA is cool 09:56 Paper insights 15:26 MSA vs. Convs (more insight) 16:07 Low-pass filters (MSA) and high-pass filters (Convs) submitted by /u/AICoffeeBreak [link] [comments]  ( 1 min )
    Machine learning in games [P]
    Hi. One month ago, I created my first game(for my senior project) in Lua using the love2d framework. This game is an icy tower clone. I planned to make use of AI so that eventually a software agent will play the game for me. AI is a very big topic and I know only the basics. I am in my last semester as a CS student and I do not have much time in hand. Any tips concerning how can I achieve what I want in a reasonable amount of time? submitted by /u/Joe2001r [link] [comments]  ( 1 min )
    [D] Anyone able to reproduce cpn detection results from PoseFormer or VideoPose3D?
    See links attached for the two specifically the cpn results: Poseformer - https://github.com/zczcwh/PoseFormer VideoPose - https://github.com/facebookresearch/VideoPose3D submitted by /u/AbjectDrink3276 [link] [comments]
    [D] what percent of top conference papers fudge results?
    I’ve noticed recently a lot of top conference papers with available repos are not reproducible. What have been everyone’s experience with this and are there any studies estimating what percent of papers fudge results? submitted by /u/AbjectDrink3276 [link] [comments]  ( 3 min )
  • Open

    From "All models are wrong, but some are useful" to "All models are biased, but some are usefully biased"
    I guess most AI practitioners are familiar with the quote: "All models are wrong, but some are useful" George E. P. Box Which is used to convey the idea that modeling reality is virtually impossible, and that one should try instead to model a bounded problem with a clear purpose, even if it fails to represent reality faithfully. To inspire my students, I came up with the following quote: "All models are biased, but some are usefully biased" In the context of XAI, my goal is to convey the idea that machine learning is all about learning biases, and that is only our human perception which eventually decides which biases are appropriate and acceptable practice, and which are wrong and should be avoided. IMO this is a key lesson for anyone working in ML bias detection, analysis and suppression. So, does the quote help in any way? Or is it too obvious and/or arrogant? ​ View Poll submitted by /u/Phastolf [link] [comments]  ( 1 min )
    Hey Friends, Here is Cool Research from Google AI Where They Proposed a Deep Learning–based Algorithm to Improve Medical Ventilator Control for Invasive Ventilation
    Mechanical ventilators assist patients in breathing undergoing surgery or who cannot breathe independently due to acute disease. The patients are A hollow tube (artificial airway) is inserted into the patient’s mouth and down into their primary airway, or trachea, to connect them to the ventilator. The ventilator follows a clinician-prescribed breathing waveform based on a respiratory measurement from the patient(e.g., airway pressure, tidal volume). Therefore, it’s a challenging task and needs systems that are both robust to variances or changes in the lungs of patients and adherent to the ideal waveform to avoid injury. That is one of the reasons why they are monitored closely by highly experienced professionals ensuring that their performance meets the needs of the patients and does not cause lung damage. A recent Google study proposes a neural network-based controller that improves medical ventilator control by balancing these properties, reducing the risk for injury to patients’ lungs. The new control algorithm detects airway pressure and computes necessary airflow modifications to meet prescribed values using signals from an artificial lung. This method requires less manual intervention from clinicians and shows great robustness and performance compared to other systems while requiring. You can read my full summary of this paper here If you want to read the paper, you can check it here Please let me know in the comments what do you think about this research. https://preview.redd.it/61vqjqpza8k81.png?width=1410&format=png&auto=webp&s=3f4eb60fd0ab34bb934d1b36281aac5008fed1e3 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    How far are we off the AI singularity? Where are we now what happens next and whats the process that "will" lead to the singularity?
    How far are we off the AI singularity? Where are we now what happens next and whats the process that "will" lead to the singularity? submitted by /u/Fun_Sun_97 [link] [comments]  ( 1 min )
    AI researchers: how do you recruit participants for your research?
    I'm working on game-playing AI as part of my 4th year dissertation, and am now required to test said AI through user evaluation. The problem is that I haven't thought about how to recruit participants for my study. It'd be really interesting to know how people who have experience with this go about recruiting participants. submitted by /u/HaresM [link] [comments]  ( 2 min )
    How can A.I. contribute to the Environment?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Grammar, Pronunciation & Background Noise Correction with Perceiver IO
    submitted by /u/OnlyProggingForFun [link] [comments]
    Using A.I. to Determine the True Color of the Environment of Mars (70 A.I. Colored Photos) | The Newest Raw Photographs of Mars Will be Produced and Uploaded to a New Article Every Few Days.
    submitted by /u/Alehti [link] [comments]  ( 1 min )
    AI art generation challenge
    I created a webpage "challenge" for creating AI images with a set of base prompt: https://ai.botbox.dev/ai-image-generation-challenge/ submitted by /u/Recent_Coffee_2551 [link] [comments]
    AI Tutorials
    I recently made this website to show how to run various AIs locally (currently only one), does anyone have any suggestions on how I could improve it? https://ai.botbox.dev/ submitted by /u/Recent_Coffee_2551 [link] [comments]
  • Open

    approximation of Optimal Action equation for continuous time domain
    In section 3.4 of QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds , how to derive equation 40 from equation 39 ? https://preview.redd.it/1w8qjqv4f6k81.png?width=802&format=png&auto=webp&s=d5f1736e64cbeec9db584d876a7db83bc8be4a66 submitted by /u/promach [link] [comments]  ( 1 min )
    Updating actor/critic after each episode
    Apologies if this has been asked before but I couldn't find anything. Is there a major issue with updating the weights of the actor/critic after the completion of an episode, as opposed to after every time step? I get that you're using a slightly 'out of date' value/advantage function for your actor policy updates, but is this actually a big problem? I ask as obviously deferring updating until the end of an episode runs significantly faster. My agent for playing Pacman using this method currently does not learn very well but I figure this may just be down to sensitivity to the hyperparameters. submitted by /u/claybecray [link] [comments]  ( 1 min )
    Why doesn't RL from pixel use pre-trained backbone?
    So I have read about papers like DrQ, RAD, etc., that aim to improve sample efficiency for RL from pixels by using data augmentation. My question is like the title, why don't people use a pre-trained backbone as the feature extractor, like VGG, ResNet, etc.? Intuitively, this should also improve sample efficiency. I haven't tried to do this, and a quick search did not lead me to any resource about this, so please let me know if it works for RL or not, and why so. submitted by /u/dx_rd_to_DX [link] [comments]  ( 2 min )
  • Open

    Interpretable Unsupervised Diversity Denoising and Artefact Removal. (arXiv:2104.01374v2 [eess.IV] UPDATED)
    Image denoising and artefact removal are complex inverse problems admitting multiple valid solutions. Unsupervised diversity restoration, that is, obtaining a diverse set of possible restorations given a corrupted image, is important for ambiguity removal in many applications such as microscopy where paired data for supervised training are often unobtainable. In real world applications, imaging noise and artefacts are typically hard to model, leading to unsatisfactory performance of existing unsupervised approaches. This work presents an interpretable approach for unsupervised and diverse image restoration. To this end, we introduce a capable architecture called Hierarchical DivNoising (HDN) based on hierarchical Variational Autoencoder. We show that HDN learns an interpretable multi-scale representation of artefacts and we leverage this interpretability to remove imaging artefacts commonly occurring in microscopy data. Our method achieves state-of-the-art results on twelve benchmark image denoising datasets while providing access to a whole distribution of sensibly restored solutions. Additionally, we demonstrate on three real microscopy datasets that HDN removes artefacts without supervision, being the first method capable of doing so while generating multiple plausible restorations all consistent with the given corrupted image.  ( 2 min )
    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v3 [cs.CV] UPDATED)
    Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.  ( 2 min )
    Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores. (arXiv:2202.06819v2 [cs.LG] UPDATED)
    Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision matrix-multiply-accumulate (MMA) instructions to increase throughput. However, it is challenging to achieve optimal performance since the best scheduling of MMA instructions varies for different convolution sizes. In particular, reduced-precision MMA requires many elements grouped as a matrix operand, seriously limiting data reuse and imposing packing and layout overhead on the schedule. This work proposes an automatic scheduling method of reduced-precision MMA for convolution operation. In this method, we devise a search space that explores the thread tile and warp sizes to increase the data reuse despite a large matrix operand of reduced-precision MMA. The search space also includes options of register-level packing and layout optimization to lesson overhead of handling reduced-precision data. Finally, we propose a search algorithm to find the best schedule by learning from the distinctive candidates. This reduced-precision MMA optimization method is evaluated on convolution operations of popular neural networks to demonstrate substantial speedup on Tensor Core compared to the state of the arts with shortened search time.  ( 2 min )
    Score-based diffusion models for accelerated MRI. (arXiv:2110.05243v2 [eess.IV] UPDATED)
    Score-based diffusion models provide a powerful way to model images using the gradient of the data distribution. Leveraging the learned score function as a prior, here we introduce a way to sample data from a conditional distribution given the measurements, such that the model can be readily used for solving inverse problems in imaging, especially for accelerated MRI. In short, we train a continuous time-dependent score function with denoising score matching. Then, at the inference stage, we iterate between numerical SDE solver and data consistency projection step to achieve reconstruction. Our model requires magnitude images only for training, and yet is able to reconstruct complex-valued data, and even extends to parallel imaging. The proposed method is agnostic to sub-sampling patterns, and can be used with any sampling schemes. Also, due to its generative nature, our approach can quantify uncertainty, which is not possible with standard regression settings. On top of all the advantages, our method also has very strong performance, even beating the models trained with full supervision. With extensive experiments, we verify the superiority of our method in terms of quality and practicality.  ( 2 min )
    Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. (arXiv:2106.02697v3 [cs.LG] UPDATED)
    Tree-based models underpin many modern semantic search engines and recommender systems due to their sub-linear inference times. In industrial applications, these models operate at extreme scales, where every bit of performance is critical. Memory constraints at extreme scales also require that models be sparse, hence tree-based models are often back-ended by sparse matrix algebra routines. However, there are currently no sparse matrix techniques specifically designed for the sparsity structure one encounters in tree-based models for extreme multi-label ranking/classification (XMR/XMC) problems. To address this issue, we present the masked sparse chunk multiplication (MSCM) technique, a sparse matrix technique specifically tailored to XMR trees. MSCM is easy to implement, embarrassingly parallelizable, and offers a significant performance boost to any existing tree inference pipeline at no cost. We perform a comprehensive study of MSCM applied to several different sparse inference schemes and benchmark our methods on a general purpose extreme multi-label ranking framework. We observe that MSCM gives consistently dramatic speedups across both the online and batch inference settings, single- and multi-threaded settings, and on many different tree models and datasets. To demonstrate its utility in industrial applications, we apply MSCM to an enterprise-scale semantic product search problem with 100 million products and achieve sub-millisecond latency of 0.88 ms per query on a single thread -- an 8x reduction in latency over vanilla inference techniques. The MSCM technique requires absolutely no sacrifices to model accuracy as it gives exactly the same results as standard sparse matrix techniques. Therefore, we believe that MSCM will enable users of XMR trees to save a substantial amount of compute resources in their inference pipelines at very little cost.  ( 2 min )
    Formalizing the Generalization-Forgetting Trade-off in Continual Learning. (arXiv:2109.14035v3 [cs.LG] UPDATED)
    We formulate the continual learning (CL) problem via dynamic programming and model the trade-off between catastrophic forgetting and generalization as a two-player sequential game. In this approach, player 1 maximizes the cost due to lack of generalization whereas player 2 minimizes the cost due to catastrophic forgetting. We show theoretically that a balance point between the two players exists for each task and that this point is stable (once the balance is achieved, the two players stay at the balance point). Next, we introduce balanced continual learning (BCL), which is designed to attain balance between generalization and forgetting and empirically demonstrate that BCL is comparable to or better than the state of the art.  ( 2 min )
    HODA: Hardness-Oriented Detection of Model Extraction Attacks. (arXiv:2106.11424v2 [cs.LG] UPDATED)
    Model Extraction attacks exploit the target model's prediction API to create a surrogate model in order to steal or reconnoiter the functionality of the target model in the black-box setting. Several recent studies have shown that a data-limited adversary who has no or limited access to the samples from the target model's training data distribution can use synthesis or semantically similar samples to conduct model extraction attacks. In this paper, we define the hardness degree of a sample using the concept of learning difficulty. The hardness degree of a sample depends on the epoch number that the predicted label of that sample converges. We investigate the hardness degree of samples and demonstrate that the hardness degree histogram of a data-limited adversary's sample sequences is distinguishable from the hardness degree histogram of benign users' samples sequences. We propose Hardness-Oriented Detection Approach (HODA) to detect the sample sequences of model extraction attacks. The results demonstrate that HODA can detect the sample sequences of model extraction attacks with a high success rate by only monitoring 100 samples of them, and it outperforms all previous model extraction detection methods.  ( 2 min )
    Object-aware Monocular Depth Prediction with Instance Convolutions. (arXiv:2112.01521v2 [cs.CV] UPDATED)
    With the advent of deep learning, estimating depth from a single RGB image has recently received a lot of attention, being capable of empowering many different applications ranging from path planning for robotics to computational cinematography. Nevertheless, while the depth maps are in their entirety fairly reliable, the estimates around object discontinuities are still far from satisfactory. This can be contributed to the fact that the convolutional operator naturally aggregates features across object discontinuities, resulting in smooth transitions rather than clear boundaries. Therefore, in order to circumvent this issue, we propose a novel convolutional operator which is explicitly tailored to avoid feature aggregation of different object parts. In particular, our method is based on estimating per-part depth values by means of superpixels. The proposed convolutional operator, which we dub "Instance Convolution", then only considers each object part individually on the basis of the estimated superpixels. Our evaluation with respect to the NYUv2 as well as the iBims dataset clearly demonstrates the superiority of Instance Convolutions over the classical convolution at estimating depth around occlusion boundaries, while producing comparable results elsewhere. Code will be made publicly available upon acceptance.  ( 2 min )
    Time Series Generation with Masked Autoencoder. (arXiv:2201.07006v2 [cs.LG] UPDATED)
    This paper shows that masked autoencoder with interpolator (InterpoMAE) is a scalable self-supervised generative model for time series. InterpoMAE masks random patches of the input time series and recover the missing patches in latent space. The core design is that no mask token is used. InterpoMAE disentangles missing patch recovery from the decoder. An interpolator directly recovers the missing patches without mask tokens. This design helps InterpoMAE to consistently and significantly outperforms state-of-the-art (SoTA) benchmarks in time series generation. Our approach also shows promising scaling behaviour in various downstream tasks such as time series classification, prediction and imputation. As the only self-supervised generative model for time series, InterpoMAE is the first in literature that allows explicit management on the synthetic data. Time series generation may follow the trajectory of self-supervised learning now.  ( 2 min )
    A Survey on Interpretable Reinforcement Learning. (arXiv:2112.13112v2 [cs.LG] UPDATED)
    Although deep reinforcement learning has become a promising machine learning approach for sequential decision-making problems, it is still not mature enough for high-stake domains such as autonomous driving or medical applications. In such contexts, a learned policy needs for instance to be interpretable, so that it can be inspected before any deployment (e.g., for safety and verifiability reasons). This survey provides an overview of various approaches to achieve higher interpretability in reinforcement learning (RL). To that aim, we distinguish interpretability (as a property of a model) and explainability (as a post-hoc operation, with the intervention of a proxy) and discuss them in the context of RL with an emphasis on the former notion. In particular, we argue that interpretable RL may embrace different facets: interpretable inputs, interpretable (transition/reward) models, and interpretable decision-making. Based on this scheme, we summarize and analyze recent work related to interpretable RL with an emphasis on papers published in the past 10 years. We also discuss briefly some related research areas and point to some potential promising research directions.  ( 2 min )
    Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations. (arXiv:1811.01908v5 [cs.LG] UPDATED)
    This work explores non-negative low-rank matrix factorization based on regularized Poisson models (PF or "Poisson factorization" for short) for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computations over zero-valued inputs, and oftentimes results in very sparse factors for both users and items. Compared to HPF (a popular Bayesian formulation of the problem with hierarchical priors), the frequentist optimization-based approach presented here tends to produce better top-N recommendations with significantly shorter fitting times, on top of having sparse solutions.  ( 2 min )
    Probing Predictions on OOD Images via Nearest Categories. (arXiv:2011.08485v4 [cs.LG] UPDATED)
    We study out-of-distribution (OOD) prediction behavior of neural networks when they classify images from unseen classes or corrupted images. To probe the OOD behavior, we introduce a new measure, nearest category generalization (NCG), where we compute the fraction of OOD inputs that are classified with the same label as their nearest neighbor in the training set. Our motivation stems from understanding the prediction patterns of adversarially robust networks, since previous work has identified unexpected consequences of training to be robust to norm-bounded perturbations. We find that robust networks have consistently higher NCG accuracy than natural training, even when the OOD data is much farther away than the robustness radius. This implies that the local regularization of robust training has a significant impact on the network's decision regions. We replicate our findings using many datasets, comparing new and existing training methods. Overall, adversarially robust networks resemble a nearest neighbor classifier when it comes to OOD data. Code available at https://github.com/yangarbiter/nearest-category-generalization.  ( 2 min )
    Solving optimization problems with Blackwell approachability. (arXiv:2202.12277v1 [math.OC])
    We introduce the Conic Blackwell Algorithm$^+$ (CBA$^+$) regret minimizer, a new parameter- and scale-free regret minimizer for general convex sets. CBA$^+$ is based on Blackwell approachability and attains $O(\sqrt{T})$ regret. We show how to efficiently instantiate CBA$^+$ for many decision sets of interest, including the simplex, $\ell_{p}$ norm balls, and ellipsoidal confidence regions in the simplex. Based on CBA$^+$, we introduce SP-CBA$^+$, a new parameter-free algorithm for solving convex-concave saddle-point problems, which achieves a $O(1/\sqrt{T})$ ergodic rate of convergence. In our simulations, we demonstrate the wide applicability of SP-CBA$^+$ on several standard saddle-point problems, including matrix games, extensive-form games, distributionally robust logistic regression, and Markov decision processes. In each setting, SP-CBA$^+$ achieves state-of-the-art numerical performance, and outperforms classical methods, without the need for any choice of step sizes or other algorithmic parameters.  ( 2 min )
    Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics. (arXiv:2109.13986v2 [cs.LG] UPDATED)
    Neural sequence models trained with maximum likelihood estimation have led to breakthroughs in many tasks, where success is defined by the gap between training and test performance. However, their ability to achieve stronger forms of generalization remains unclear. We consider the problem of symbolic mathematical integration, as it requires generalizing systematically beyond the test set. We develop a methodology for evaluating generalization that takes advantage of the problem domain's structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and out-of-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner. Our investigation highlights the difficulty of generalizing well with the predominant modeling and learning approach, and the importance of evaluating beyond the test set, across different aspects of generalization.  ( 2 min )
    Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings. (arXiv:2112.06127v2 [cs.LG] UPDATED)
    Building upon prior research that highlighted the need for standardizing environments for building control research, and inspired by recently introduced challenges for real life reinforcement learning control, here we propose a non-exhaustive set of nine real world challenges for reinforcement learning control in grid-interactive buildings. We argue that research in this area should be expressed in this framework in addition to providing a standardized environment for repeatability. Advanced controllers such as model predictive control and reinforcement learning (RL) control have both advantages and disadvantages that prevent them from being implemented in real world problems. Comparisons between the two are rare, and often biased. By focusing on the challenges, we can investigate the performance of the controllers under a variety of situations and generate a fair comparison. As a demonstration, we implement the offline learning challenge in CityLearn and study the impact of different levels of domain knowledge and complexity of RL algorithms. We show that the sequence of operations utilized in a rule based controller (RBC) used for offline training affects the performance of the RL agents when evaluated on a set of four energy flexibility metrics. Longer offline learning from an optimized RBC leads to improved performance in the long run. RL agents that learn from a simplified RBC risk poorer performance as the offline learning period increases. We also observe no impact on performance from information sharing amongst agents. We call for a more interdisciplinary effort of the research community to address the real world challenges, and unlock the potential of grid-interactive building  ( 2 min )
    Intelligent Zero Trust Architecture for 5G/6G Networks: Principles, Challenges, and the Role of Machine Learning in the context of O-RAN. (arXiv:2105.01478v2 [cs.NI] UPDATED)
    In this position paper, we discuss the critical need for integrating zero trust (ZT) principles into next-generation communication networks (5G/6G). We highlight the challenges and introduce the concept of an intelligent zero trust architecture (i-ZTA) as a security framework in 5G/6G networks with untrusted components. While network virtualization, software-defined networking (SDN), and service-based architectures (SBA) are key enablers of 5G networks, operating in an untrusted environment has also become a key feature of the networks. Further, seamless connectivity to a high volume of devices has broadened the attack surface on information infrastructure. Network assurance in a dynamic untrusted environment calls for revolutionary architectures beyond existing static security frameworks. To the best of our knowledge, this is the first position paper that presents the architectural concept design of an i-ZTA upon which modern artificial intelligence (AI) algorithms can be developed to provide information security in untrusted networks. We introduce key ZT principles as real-time Monitoring of the security state of network assets, Evaluating the risk of individual access requests, and Deciding on access authorization using a dynamic trust algorithm, called MED components. To ensure ease of integration, the envisioned architecture adopts an SBA-based design, similar to the 3GPP specification of 5G networks, by leveraging the open radio access network (O-RAN) architecture with appropriate real-time engines and network interfaces for collecting necessary machine learning data. Therefore, this work provides novel research directions to design machine learning based components that contribute towards i-ZTA for the future 5G/6G networks.  ( 2 min )
    Exact Community Recovery over Signed Graphs. (arXiv:2202.12255v1 [cs.SI])
    Signed graphs encode similarity and dissimilarity relationships among different entities with positive and negative edges. In this paper, we study the problem of community recovery over signed graphs generated by the signed stochastic block model (SSBM) with two equal-sized communities. Our approach is based on the maximum likelihood estimation (MLE) of the SSBM. Unlike many existing approaches, our formulation reveals that the positive and negative edges of a signed graph should be treated unequally. We then propose a simple two-stage iterative algorithm for solving the regularized MLE. It is shown that in the logarithmic degree regime, the proposed algorithm can exactly recover the underlying communities in nearly-linear time at the information-theoretic limit. Numerical results on both synthetic and real data are reported to validate and complement our theoretical developments and demonstrate the efficacy of the proposed method.  ( 2 min )
    Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. (arXiv:2202.12267v1 [eess.IV])
    In the application of deep learning on optical coherence tomography (OCT) data, it is common to train classification networks using 2D images originating from volumetric data. Given the micrometer resolution of OCT systems, consecutive images are often very similar in both visible structures and noise. Thus, an inappropriate data split can result in overlap between the training and testing sets, with a large portion of the literature overlooking this aspect. In this study, the effect of improper dataset splitting on model evaluation is demonstrated for two classification tasks using two OCT open-access datasets extensively used in the literature, Kermany's ophthalmology dataset and AIIMS breast tissue dataset. Our results show that the classification accuracy is inflated by 3.9 to 26 percentage units for models tested on a dataset with improper splitting, highlighting the considerable effect of dataset handling on model evaluation. This study intends to raise awareness on the importance of dataset splitting for research on deep learning using OCT data and volumetric data in general.  ( 2 min )
    Online Graph Learning in Dynamic Environments. (arXiv:2110.05023v2 [cs.LG] UPDATED)
    Inferring the underlying graph topology that characterizes structured data is pivotal to many graph-based models when pre-defined graphs are not available. This paper focuses on learning graphs in the case of sequential data in dynamic environments. For sequential data, we develop an online version of classic batch graph learning method. To better track graphs in dynamic environments, we assume graphs evolve in certain patterns such that dynamic priors might be embedded in the online graph learning framework. When the information of these hidden patterns is not available, we use history data to predict the evolution of graphs. Furthermore, dynamic regret analysis of the proposed method is performed and illustrates that our online graph learning algorithms can reach sublinear dynamic regret. Experimental results support the fact that our method is superior to the state-of-art methods.  ( 2 min )
    MLDemon: Deployment Monitoring for Machine Learning Systems. (arXiv:2104.13621v5 [cs.LG] UPDATED)
    Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.  ( 2 min )
    Separation of Scales and a Thermodynamic Description of Feature Learning in Some CNNs. (arXiv:2112.15383v2 [stat.ML] UPDATED)
    Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, renders direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, in various settings, the latter fluctuate in a nearly Gaussian manner. For CNNs with infinitely many channels, these kernels are inert, while for finite CNNs they adapt to the data. In several deep non-linear CNN models trained on real data, the resulting thermodynamic theory of deep learning yields accurate predictions. In addition, it provides new ways of analyzing and understanding CNNs, and DNNs in general.  ( 2 min )
    Adaptive Multi-Goal Exploration. (arXiv:2111.12045v2 [cs.LG] UPDATED)
    We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $\epsilon$-optimal goal-conditioned policy for the (initially unknown) set of goal states that are reachable within $L$ steps in expectation from a reference state $s_0$ in a reward-free Markov decision process. In the tabular case with $S$ states and $A$ actions, our algorithm requires $\tilde{O}(L^3 S A \epsilon^{-2})$ exploration steps, which is nearly minimax optimal. We also readily instantiate AdaGoal in linear mixture Markov decision processes, yielding the first goal-oriented PAC guarantee with linear function approximation. Beyond its strong theoretical guarantees, we anchor AdaGoal in goal-conditioned deep reinforcement learning, both conceptually and empirically, by connecting its idea of selecting "uncertain" goals to maximizing value ensemble disagreement.  ( 2 min )
    INTERN: A New Learning Paradigm Towards General Vision. (arXiv:2111.08687v2 [cs.CV] UPDATED)
    Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner. See project website at https://opengvlab.shlab.org.cn .  ( 3 min )
    Systematic review of deep learning and machine learning for building energy. (arXiv:2202.12269v1 [cs.LG])
    The building energy (BE) management has an essential role in urban sustainability and smart cities. Recently, the novel data science and data-driven technologies have shown significant progress in analyzing the energy consumption and energy demand data sets for a smarter energy management. The machine learning (ML) and deep learning (DL) methods and applications, in particular, have been promising for the advancement of the accurate and high-performance energy models. The present study provides a comprehensive review of ML and DL-based techniques applied for handling BE systems, and it further evaluates the performance of these techniques. Through a systematic review and a comprehensive taxonomy, the advances of ML and DL-based techniques are carefully investigated, and the promising models are introduced. According to the results obtained for energy demand forecasting, the hybrid and ensemble methods are located in high robustness range, SVM-based methods are located in good robustness limitation, ANN-based methods are located in medium robustness limitation and linear regression models are located in low robustness limitations. On the other hand, for energy consumption forecasting, DL-based, hybrid, and ensemble-based models provided the highest robustness score. ANN, SVM, and single ML models provided good and medium robustness and LR-based models provided the lower robustness score. In addition, for energy load forecasting, LR-based models provided the lower robustness score. The hybrid and ensemble-based models provided a higher robustness score. The DL-based and SVM-based techniques provided a good robustness score and ANN-based techniques provided a medium robustness score.  ( 2 min )
    Integration of neural network and fuzzy logic decision making compared with bilayered neural network in the simulation of daily dew point temperature. (arXiv:2202.12256v1 [cs.LG])
    In this research, dew point temperature (DPT) is simulated using the data-driven approach. Adaptive Neuro-Fuzzy Inference System (ANFIS) is utilized as a data-driven technique to forecast this parameter at Tabriz in East Azerbaijan. Various input patterns, namely T min, T max, and T mean, are utilized for training the architecture whilst DPT is the model's output. The findings indicate that, in general, ANFIS method is capable of identifying data patterns with a high degree of accuracy. However, the approach demonstrates that processing time and computer resources may substantially increase by adding additional functions. Based on the results, the number of iterations and computing resources might change dramatically if new functionalities are included. As a result, tuning parameters have to be optimized inside the method framework. The findings demonstrate a high agreement between results by the data-driven technique (machine learning method) and the observed data. Using this prediction toolkit, DPT can be adequately forecasted solely based on the temperature distribution of Tabriz. This kind of modeling is extremely promising for predicting DPT at various sites. Besides, this study thoroughly compares the Bilayered Neural Network (BNN) and ANFIS models on various scales. Whilst the ANFIS model is extremely stable for almost all numbers of membership functions, the BNN model is highly sensitive to this scale factor to predict DPT.  ( 2 min )
    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. (arXiv:2110.07205v2 [eess.AS] UPDATED)
    Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We will release our code and model at https://github.com/microsoft/SpeechT5.  ( 2 min )
    Identifying causal relations in tweets using deep learning: Use case on diabetes-related tweets from 2017-2021. (arXiv:2111.01225v4 [cs.CL] UPDATED)
    Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect-tweet dataset was manually labeled and used to train 1) a fine-tuned Bertweet model to detect causal sentences containing a causal association 2) a CRF model with BERT based features to extract possible cause-effect associations. Causes and effects were clustered in a semi-supervised approach and visualised in an interactive cause-effect-network. Results: Causal sentences were detected with a recall of 68% in an imbalanced dataset. A CRF model with BERT based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68%. This led to 96,676 sentences with cause-effect associations. "Diabetes" was identified as the central cluster followed by "Death" and "Insulin". Insulin pricing related causes were frequently associated with "Death". Conclusions: A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multi-word cause and corresponding effect as expressed in diabetes-related tweets leveraging BERT-based architectures and visualised as cause-effect-network. Extracting causal associations on real-life, patient reported outcomes in social media data provides a useful complementary source of information in diabetes research.  ( 3 min )
    Simultaneous Missing Value Imputation and Structure Learning with Groups. (arXiv:2110.08223v2 [cs.LG] UPDATED)
    Learning structures between groups of variables from data with missing values is an important task in the real world, yet difficult to solve. One typical scenario is discovering the structure among topics in the education domain to identify learning pathways. Here, the observations are student performances for questions under each topic which contain missing values. However, most existing methods focus on learning structures between a few individual variables from the complete data. In this work, we propose VISL, a novel scalable structure learning approach that can simultaneously infer structures between groups of variables under missing data and perform missing value imputations with deep learning. Particularly, we propose a generative model with a structured latent space and a graph neural network-based architecture, scaling to a large number of variables. Empirically, we conduct extensive experiments on synthetic, semi-synthetic, and real-world education data sets. We show improved performances on both imputation and structure learning accuracy compared to popular and recent approaches.  ( 2 min )
    In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications. (arXiv:2110.09461v2 [cs.AI] UPDATED)
    We address the problem of building agents whose goal is to learn to execute out-of distribution (OOD) multi-task instructions expressed in temporal logic (TL) by using deep reinforcement learning (DRL). Recent works provided evidence that the agent's neural architecture is a key feature when DRL agents are learning to solve OOD tasks in TL. Yet, the studies on this topic are still in their infancy. In this work, we propose a new deep learning configuration with inductive biases that lead agents to generate latent representations of their current goal, yielding a stronger generalization performance. We use these latent-goal networks within a neuro-symbolic framework that executes multi-task formally-defined instructions and contrast the performance of the proposed neural networks against employing different state-of-the-art (SOTA) architectures when generalizing to unseen instructions in OOD environments.  ( 2 min )
    Multivariate Deep Evidential Regression. (arXiv:2104.06135v4 [cs.LG] UPDATED)
    There is significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, yet several important gaps in the theory and implementation of these networks remain. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks. The approach derives a technique by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. Doing so allows for the simultaneous extraction of both uncertainties without sampling or utilization of out-of-distribution data for univariate regression tasks. We describe the outstanding issues in detail, provide a possible solution, and generalize the deep evidential regression technique for multivariate cases.  ( 2 min )
    On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity. (arXiv:2108.07992v2 [stat.ML] UPDATED)
    We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalence form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalence forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalence forms, we develop optimization algorithm, named ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $\tilde{\mathcal{O}}(m^3(n+1)^{m}/ \varepsilon^2)$ where $\varepsilon > 0$ stands for the desired tolerance.  ( 2 min )
    High-quality Thermal Gibbs Sampling with Quantum Annealing Hardware. (arXiv:2109.01690v2 [quant-ph] UPDATED)
    Quantum Annealing (QA) was originally intended for accelerating the solution of combinatorial optimization tasks that have natural encodings as Ising models. However, recent experiments on QA hardware platforms have demonstrated that, in the operating regime corresponding to weak interactions, the QA hardware behaves like a noisy Gibbs sampler at a hardware-specific effective temperature. This work builds on those insights and identifies a class of small hardware-native Ising models that are robust to noise effects and proposes a procedure for executing these models on QA hardware to maximize Gibbs sampling performance. Experimental results indicate that the proposed protocol results in high-quality Gibbs samples from a hardware-specific effective temperature. Furthermore, we show that this effective temperature can be adjusted by modulating the annealing time and energy scale. The procedure proposed in this work provides an approach to using QA hardware for Ising model sampling presenting potential new opportunities for applications in machine learning and physics simulation.  ( 2 min )
    Covariance-Generalized Matching Component Analysis for Data Fusion and Transfer Learning. (arXiv:2110.13194v2 [cs.LG] UPDATED)
    In order to allow for the encoding of additional statistical information in data fusion and transfer learning applications, we introduce a generalized covariance constraint for the matching component analysis (MCA) transfer learning technique. After proving a semi-orthogonally constrained trace maximization lemma, we develop a closed-form solution to the resulting covariance-generalized optimization problem and provide an algorithm for its computation. We call this technique -- applicable to both data fusion and transfer learning -- covariance-generalized MCA (CGMCA).  ( 2 min )
    Robust Deep Learning from Crowds with Belief Propagation. (arXiv:2111.00734v2 [cs.LG] UPDATED)
    Crowdsourcing systems enable us to collect large-scale dataset, but inherently suffer from noisy labels of low-paid workers. We address the inference and learning problems using such a crowdsourced dataset with noise. Due to the nature of sparsity in crowdsourcing, it is critical to exploit both probabilistic model to capture worker prior and neural network to extract task feature despite risks from wrong prior and overfitted feature in practice. We hence establish a neural-powered Bayesian framework, from which we devise deepMF and deepBP with different choice of variational approximation methods, mean field (MF) and belief propagation (BP), respectively. This provides a unified view of existing methods, which are special cases of deepMF with different priors. In addition, our empirical study suggests that deepBP is a new approach, which is more robust against wrong prior, feature overfitting and extreme workers thanks to the more sophisticated BP than MF.  ( 2 min )
    Equivariant Deep Dynamical Model for Motion Prediction. (arXiv:2111.01892v2 [cs.LG] UPDATED)
    Learning representations through deep generative modeling is a powerful approach for dynamical modeling to discover the most simplified and compressed underlying description of the data, to then use it for other tasks such as prediction. Most learning tasks have intrinsic symmetries, i.e., the input transformations leave the output unchanged, or the output undergoes a similar transformation. The learning process is, however, usually uninformed of these symmetries. Therefore, the learned representations for individually transformed inputs may not be meaningfully related. In this paper, we propose an SO(3) equivariant deep dynamical model (EqDDM) for motion prediction that learns a structured representation of the input space in the sense that the embedding varies with symmetry transformations. EqDDM is equipped with equivariant networks to parameterize the state-space emission and transition models. We demonstrate the superior predictive performance of the proposed model on various motion data.  ( 2 min )
    BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks. (arXiv:2109.05539v4 [cs.NE] UPDATED)
    Brain-inspired computation and information processing alongside compatibility with neuromorphic hardware have made spiking neural networks (SNN) a promising method for solving learning tasks in machine learning (ML). However, the mere use of spiking neurons is not sufficient for building models that mimic the learning mechanisms in the biological brain; network architecture and learning rules are important factors to consider when developing such artificial agents. In this work, inspired by the human visual pathway and the role of dopamine in learning, we propose a reward-modulated locally connected spiking neural network, BioLCNet, for visual learning tasks. To extract visual features from Poisson-distributed spike trains, we used local filters that are more analogous to the biological visual system compared to convolutional filters with weight sharing. In the decoding layer, we applied a spike population-based voting scheme to determine the decision of the network. We employed Spike-timing-dependent plasticity (STDP) for learning the visual features, and its reward-modulated variant (R-STDP) for training the decoder based on the reward or punishment feedback signal. For evaluation, we first assessed the robustness of our rewarding mechanism to varying target responses in a classical conditioning experiment. Afterwards, we evaluated the performance of our network on image classification tasks of MNIST and XOR MNIST datasets.  ( 2 min )
    Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error. (arXiv:2105.13343v2 [cs.LG] UPDATED)
    In computer vision, it is standard practice to draw a single sample from the data augmentation procedure for each unique image in the mini-batch. However recent work has suggested drawing multiple samples can achieve higher test accuracies. In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets. We demonstrate drawing multiple samples per image consistently enhances the test accuracy achieved for both small and large batch training. Crucially, this benefit arises even if different numbers of augmentations per image perform the same number of parameter updates and gradient evaluations (requiring the same total compute). Although prior work has found variance in the gradient estimate arising from subsampling the dataset has an implicit regularization benefit, our experiments suggest variance which arises from the data augmentation process harms generalization. We apply these insights to the highly performant NFNet-F5, achieving 86.8$\%$ top-1 w/o extra data on ImageNet.  ( 2 min )
    Reconfigurable Intelligent Surfaces in Action for Non-Terrestrial Networks. (arXiv:2012.00968v4 [cs.IT] UPDATED)
    Next-generation communication technology will be made possible by cooperation between terrestrial networks with non-terrestrial networks (NTN) comprised of high-altitude platform stations and satellites. Further, as humanity embarks on the long road to establish new habitats on other planets, cooperation between NTN and deep-space networks (DSN) will be necessary. In this regard, we propose the use of reconfigurable intelligent surfaces (RIS) to improve coordination between these networks given that RIS perfectly match the size, weight, and power restrictions of operating in space. A comprehensive framework of RIS-assisted non-terrestrial and interplanetary communications is presented that pinpoints challenges, use cases, and open issues. Furthermore, the performance of RIS-assisted NTN under environmental effects such as solar scintillation and satellite drag is discussed in light of simulation results.  ( 2 min )
    Double Control Variates for Gradient Estimation in Discrete Latent Variable Models. (arXiv:2111.05300v2 [stat.ML] UPDATED)
    Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.  ( 2 min )
    Sobolev Norm Learning Rates for Conditional Mean Embeddings. (arXiv:2105.07446v3 [stat.ML] UPDATED)
    We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.  ( 2 min )
    A Perceptual Measure for Evaluating the Resynthesis of Automatic Music Transcriptions. (arXiv:2202.12257v1 [cs.SD])
    This study focuses on the perception of music performances when contextual factors, such as room acoustics and instrument, change. We propose to distinguish the concept of "performance" from the one of "interpretation", which expresses the "artistic intention". Towards assessing this distinction, we carried out an experimental evaluation where 91 subjects were invited to listen to various audio recordings created by resynthesizing MIDI data obtained through Automatic Music Transcription (AMT) systems and a sensorized acoustic piano. During the resynthesis, we simulated different contexts and asked listeners to evaluate how much the interpretation changes when the context changes. Results show that: (1) MIDI format alone is not able to completely grasp the artistic intention of a music performance; (2) usual objective evaluation measures based on MIDI data present low correlations with the average subjective evaluation. To bridge this gap, we propose a novel measure which is meaningfully correlated with the outcome of the tests. In addition, we investigate multimodal machine learning by providing a new score-informed AMT method and propose an approximation algorithm for the $p$-dispersion problem.  ( 2 min )
    A Multilayered Block Network Model to Forecast Large Dynamic Transportation Graphs: an Application to US Air Transport. (arXiv:1911.13136v4 [stat.ML] UPDATED)
    Dynamic transportation networks have been analyzed for years by means of static graph-based indicators in order to study the temporal evolution of relevant network components, and to reveal complex dependencies that would not be easily detected by a direct inspection of the data. This paper presents a state-of-the-art probabilistic latent network model to forecast multilayer dynamic graphs that are increasingly common in transportation and proposes a community-based extension to reduce the computational burden. Flexible time series analysis is obtained by modeling the probability of edges between vertices through latent Gaussian processes. The models and Bayesian inference are illustrated on a sample of 10-year data from four major airlines within the US air transportation system. Results show how the estimated latent parameters from the models are related to the airline's connectivity dynamics, and their ability to project the multilayer graph into the future for out-of-sample full network forecasts, while stochastic blockmodeling allows for the identification of relevant communities. Reliable network predictions would allow policy-makers to better understand the dynamics of the transport system, and help in their planning on e.g. route development, or the deployment of new regulations.  ( 2 min )
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v2 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals (i.e. scalar summaries) of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.  ( 2 min )
    Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA. (arXiv:2107.10098v3 [stat.ML] UPDATED)
    This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent factors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations.  ( 2 min )
    Effect Identification in Cluster Causal Diagrams. (arXiv:2202.12263v1 [stat.ME])
    One pervasive task found throughout the empirical sciences is to determine the effect of interventions from non-experimental data. It is well-understood that assumptions are necessary to perform causal inferences, which are commonly articulated through causal diagrams (Pearl, 2000). Despite the power of this approach, there are settings where the knowledge necessary to specify a causal diagram over all observed variables may not be available, particularly in complex, high-dimensional domains. In this paper, we introduce a new type of graphical model called cluster causal diagrams (for short, C-DAGs) that allows for the partial specification of relationships among variables based on limited prior knowledge, alleviating the stringent requirement of specifying a full causal diagram. A C-DAG specifies relationships between clusters of variables, while the relationships between the variables within a cluster are left unspecified. We develop the foundations and machinery for valid causal inferences over C-DAGs. In particular, we first define a new version of the d-separation criterion and prove its soundness and completeness. Secondly, we extend these new separation rules and prove the validity of the corresponding do-calculus. Lastly, we show that a standard identification algorithm is sound and complete to systematically compute causal effects from observational data given a C-DAG.  ( 2 min )
    Submodular Maximization in Clean Linear Time. (arXiv:2006.09327v4 [cs.DS] UPDATED)
    In this paper, we provide the first deterministic algorithm that achieves the tight $1-1/e$ approximation guarantee for submodular maximization under a cardinality (size) constraint while making a number of queries that scales only linearly with the size of the ground set $n$. To complement our result, we also show strong information-theoretic lower bounds. More specifically, we show that when the maximum cardinality allowed for a solution is constant, no algorithm making a sub-linear number of function evaluations can guarantee any constant approximation ratio. Furthermore, when the constraint allows the selection of a constant fraction of the ground set, we show that any algorithm making fewer than $\Omega(n/\log(n))$ function evaluations cannot perform better than an algorithm that simply outputs a uniformly random subset of the ground set of the right size. We then provide a variant of our deterministic algorithm for the more general knapsack constraint, which is the first linear-time algorithm that achieves $1/2$-approximation guarantee for this constraint. Finally, we extend our results to the general case of maximizing a monotone submodular function subject to the intersection of a $p$-set system and multiple knapsack constraints. We extensively evaluate the performance of our algorithms on multiple real-life machine learning applications, including movie recommendation, location summarization, twitter text summarization and video summarization.  ( 2 min )
    Quadric Hypersurface Intersection for Manifold Learning in Feature Space. (arXiv:2102.06186v2 [cs.LG] UPDATED)
    The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces -- simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it.  ( 2 min )
    Factorizer: A Scalable Interpretable Approach to Context Modeling for Medical Image Segmentation. (arXiv:2202.12295v1 [eess.IV])
    Convolutional Neural Networks (CNNs) with U-shaped architectures have dominated medical image segmentation, which is crucial for various clinical purposes. However, the inherent locality of convolution makes CNNs fail to fully exploit global context, essential for better recognition of some structures, e.g., brain lesions. Transformers have recently proved promising performance on vision tasks, including semantic segmentation, mainly due to their capability of modeling long-range dependencies. Nevertheless, the quadratic complexity of attention makes existing Transformer-based models use self-attention layers only after somehow reducing the image resolution, which limits the ability to capture global contexts present at higher resolutions. Therefore, this work introduces a family of models, dubbed Factorizer, which leverages the power of low-rank matrix factorization for constructing an end-to-end segmentation model. Specifically, we propose a linearly scalable approach to context modeling, formulating Nonnegative Matrix Factorization (NMF) as a differentiable layer integrated into a U-shaped architecture. The shifted window technique is also utilized in combination with NMF to effectively aggregate local information. Factorizers compete favorably with CNNs and Transformers in terms of accuracy, scalability, and interpretability, achieving state-of-the-art results on the BraTS dataset for brain tumor segmentation, with Dice scores of 79.33%, 83.14%, and 90.16% for enhancing tumor, tumor core, and whole tumor, respectively. Highly meaningful NMF components give an additional interpretability advantage to Factorizers over CNNs and Transformers. Moreover, our ablation studies reveal a distinctive feature of Factorizers that enables a significant speed-up in inference for a trained Factorizer without any extra steps and without sacrificing much accuracy.  ( 2 min )
    Policy design in experiments with unknown interference. (arXiv:2011.08174v6 [econ.EM] UPDATED)
    This paper proposes an experimental design for estimation and inference on welfare-maximizing policies in the presence of spillover effects. I consider a setting where units are organized into a finite number of large clusters and interact in unknown ways within each cluster. As a first contribution, I introduce a single-wave experiment which, by carefully varying the randomization across pairs of clusters, estimates the marginal effect of a change in treatment probabilities, taking spillover effects into account. Using the marginal effect, I propose a practical test for policy optimality. The idea is that researchers should report the marginal effect and test for policy optimality: the marginal effect indicates the direction for a welfare improvement, and the test provides evidence on whether it is worth conducting additional experiments to estimate a welfare-improving treatment allocation. As a second contribution, I design a multiple-wave experiment to estimate treatment assignment rules and maximize welfare. I derive strong small-sample guarantees on the difference between the maximum attainable welfare and the welfare evaluated at the estimated policy, which converges linearly in the number of clusters and iterations. Simulations calibrated to existing experiments on information diffusion and cash-transfer programs show welfare improvements up to fifty percentage points.  ( 2 min )
    On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems. (arXiv:2202.12262v1 [cs.LG])
    We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two. It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine. In contrast to previous works, our analysis covers all sampling and parameterization regimes, general differentiable loss functions, arbitrary continuous nonpolynomial activation functions, and both the finite- and infinite-dimensional setting. It is further shown that the appearance of the spurious local minima in the considered training problems is a direct consequence of the universal approximation theorem and that the underlying mechanisms also cause, e.g., Lp-best approximation problems to be ill-posed in the sense of Hadamard for all networks that do not have a dense image. The latter result also holds without the assumption of local affine linearity and without any conditions on the hidden layers. The paper concludes with a numerical experiment which demonstrates that spurious local minima can indeed affect the convergence behavior of gradient-based solution algorithms in practice.  ( 2 min )
    Measuring CLEVRness: Blackbox testing of Visual Reasoning Models. (arXiv:2202.12162v1 [cs.LG])
    How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a human level, can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.  ( 2 min )
    Partitioned Variational Inference: A framework for probabilistic federated learning. (arXiv:2202.12275v1 [stat.ML])
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.  ( 2 min )
    Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models. (arXiv:2010.13933v4 [stat.ML] UPDATED)
    The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in \emph{both} the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.  ( 3 min )
    On the influence of roundoff errors on the convergence of the gradient descent method with low-precision floating-point computation. (arXiv:2202.12276v1 [cs.LG])
    The employment of stochastic rounding schemes helps prevent stagnation of convergence, due to vanishing gradient effect when implementing the gradient descent method in low precision. Conventional stochastic rounding achieves zero bias by preserving small updates with probabilities proportional to their relative magnitudes. In this study, we propose a new stochastic rounding scheme that trades the zero bias property with a larger probability to preserve small gradients. Our method yields a constant rounding bias that, at each iteration, lies in a descent direction. For convex problems, we prove that the proposed rounding method has a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with 8-bit floating-point format.  ( 2 min )
    Evaluating Feature Attribution Methods in the Image Domain. (arXiv:2202.12270v1 [cs.CV])
    Feature attribution maps are a popular approach to highlight the most important pixels in an image for a given prediction of a model. Despite a recent growth in popularity and available methods, little attention is given to the objective evaluation of such attribution maps. Building on previous work in this domain, we investigate existing metrics and propose new variants of metrics for the evaluation of attribution maps. We confirm a recent finding that different attribution metrics seem to measure different underlying concepts of attribution maps, and extend this finding to a larger selection of attribution metrics. We also find that metric results on one dataset do not necessarily generalize to other datasets, and methods with desirable theoretical properties such as DeepSHAP do not necessarily outperform computationally cheaper alternatives. Based on these findings, we propose a general benchmarking approach to identify the ideal feature attribution method for a given use case. Implementations of attribution metrics and our experiments are available online.  ( 2 min )
    Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization. (arXiv:2111.00705v2 [cs.LG] UPDATED)
    Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.  ( 2 min )
    Analyzing Micro-Founded General Equilibrium Models with Many Agents using Deep Reinforcement Learning. (arXiv:2201.01163v2 [cs.GT] UPDATED)
    Real economies can be modeled as a sequential imperfect-information game with many heterogeneous agents, such as consumers, firms, and governments. Dynamic general equilibrium (DGE) models are often used for macroeconomic analysis in this setting. However, finding general equilibria is challenging using existing theoretical or computational methods, especially when using microfoundations to model individual agents. Here, we show how to use deep multi-agent reinforcement learning (MARL) to find $\epsilon$-meta-equilibria over agent types in microfounded DGE models. Whereas standard MARL fails to learn non-trivial solutions, our structured learning curricula enable stable convergence to meaningful solutions. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., continuous market clearing, that are commonly used for analytical tractability. Furthermore, our end-to-end GPU implementation enables fast real-time convergence with a large number of RL economic agents. We showcase our approach in open and closed real-business-cycle (RBC) models with 100 worker-consumers, 10 firms, and a social planner who taxes and redistributes. We validate the learned solutions are $\epsilon$-meta-equilibria through best-response analyses, show that they align with economic intuitions, and show our approach can learn a spectrum of qualitatively distinct $\epsilon$-meta-equilibria in open RBC models. As such, we show that hardware-accelerated MARL is a promising framework for modeling the complexity of economies based on microfoundations.  ( 2 min )
    Reinforcement Learning for Personalized Drug Discovery and Design for Complex Diseases: A Systems Pharmacology Perspective. (arXiv:2201.08894v3 [q-bio.BM] UPDATED)
    Many multi-genic systemic diseases such as neurological disorders, inflammatory diseases, and the majority of cancers do not have effective treatments yet. Reinforcement learning powered systems pharmacology is a potentially effective approach to design personalized therapies for untreatable complex diseases. In this survey, state-of-the-art reinforcement learning methods and their latest applications to drug design are reviewed. The challenges on harnessing reinforcement learning for systems pharmacology and personalized medicine are discussed. Potential solutions to overcome the challenges are proposed. In spite of successful application of advanced reinforcement learning techniques to target-based drug discovery, new reinforcement learning strategies are needed to address systems pharmacology-oriented personalized de novo drug design.  ( 2 min )
    Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models. (arXiv:2201.09644v2 [cs.LG] UPDATED)
    Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique.  ( 2 min )
    Block Policy Mirror Descent. (arXiv:2201.05756v2 [cs.LG] UPDATED)
    In this paper, we present a new class of policy gradient (PG) methods, namely the block policy mirror descent (BPMD) methods for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD methods have cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, BPMD methods achieve fast linear convergence to the global optimality. We further extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. We establish $\mathcal{O}(1/\epsilon)$ (resp. $\mathcal{O}(1/\epsilon^2)$) sample complexity for the strongly convex (resp. non-strongly convex) regularizers, where $\epsilon$ denotes the target accuracy, with different procedures for constructing the stochastic first-order information. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.  ( 2 min )
    Self-Supervised Radio-Visual Representation Learning for 6G Sensing. (arXiv:2111.02887v2 [cs.NI] UPDATED)
    In future 6G cellular networks, a joint communication and sensing protocol will allow the network to perceive the environment, opening the door for many new applications atop a unified communication-perception infrastructure. However, interpreting the sparse radio representation of sensing scenes is challenging, which hinders the potential of these emergent systems. We propose to combine radio and vision to automatically learn a radio-only sensing model with minimal human intervention. We want to build a radio sensing model that can feed on millions of uncurated data points. To this end, we leverage recent advances in self-supervised learning and formulate a new label-free radio-visual co-learning scheme, whereby vision trains radio via cross-modal mutual information. We implement and evaluate our scheme according to the common linear classification benchmark, and report qualitative and quantitative performance metrics. In our evaluation, the representation learnt by radio-visual self-supervision works well for a downstream sensing demonstrator, and outperforms its fully-supervised counterpart when less labelled data is used. This indicates that self-supervised learning could be an important enabler for future scalable radio sensing systems.  ( 2 min )
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v2 [cs.LG] UPDATED)
    Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.  ( 2 min )
    Conditionally Gaussian PAC-Bayes. (arXiv:2110.11886v2 [cs.LG] UPDATED)
    Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.  ( 2 min )
    Universal Efficient Variable-rate Neural Image Compression. (arXiv:2111.11305v4 [eess.IV] UPDATED)
    Recently, Learning-based image compression has reached comparable performance with traditional image codecs(such as JPEG, BPG, WebP). However, computational complexity and rate flexibility are still two major challenges for its practical deployment. To tackle these problems, this paper proposes two universal modules named Energy-based Channel Gating(ECG) and Bit-rate Modulator(BM), which can be directly embedded into existing end-to-end image compression models. ECG uses dynamic pruning to reduce FLOPs for more than 50\% in convolution layers, and a BM pair can modulate the latent representation to control the bit-rate in a channel-wise manner. By implementing these two modules, existing learning-based image codecs can obtain ability to output arbitrary bit-rate with a single model and reduced computation.  ( 2 min )
    An Alternate Policy Gradient Estimator for Softmax Policies. (arXiv:2112.11622v2 [cs.LG] UPDATED)
    Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy saturation may arise from bad policy initialization or sudden changes in the environment that occur after the policy has already converged. Current softmax PG estimators require a large number of updates to overcome policy saturation, which causes low sample efficiency and poor adaptability to new situations. To mitigate this problem, we propose a novel PG estimator for softmax policies that utilizes the bias in the critic estimate and the noise present in the reward signal to escape the saturated regions of the policy parameter space. Our theoretical analysis and experiments, conducted on bandits and various reinforcement learning environments, show that this new estimator is significantly more robust to policy saturation.  ( 2 min )
    TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion. (arXiv:2106.03947v3 [cs.LG] UPDATED)
    This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with approximation. However, the approximations do not reduce the overall time significantly and lead to less accurate parameter updates and loss of curvature information. TENGraD improves the time efficiency of NGD by computing Fisher block inverses with a computationally efficient covariance factorization and reuse method. It computes the inverse of each block exactly using the Woodbury matrix identity to preserve curvature information while admitting (linear) fast convergence rates. Our experiments on image classification tasks for state-of-the-art deep neural architecture on CIFAR-10, CIFAR-100, and Fashion-MNIST show that TENGraD significantly outperforms state-of-the-art NGD methods and often stochastic gradient descent in wall-clock time.  ( 2 min )
    Cognitive Explainers of Graph Neural Networks Based on Medical Concepts. (arXiv:2201.07798v2 [cs.LG] UPDATED)
    Although deep neural networks (DNN) have achieved state-of-the-art performance in various fields, some unexpected errors are often found in the neural network, which is very dangerous for some tasks requiring high reliability and high security. The non-transparency and unexplainably of Convolutional Neural Networks (CNN) still limit its application in many fields, such as medical care and finance. Despite current studies that have been committed to visualizing the decision process of DNN, most of these methods focus on the low level and do not take into account the prior knowledge of medicine. In this work, we propose an interpretable framework based on key medical concepts, enabling CNN to explain from the perspective of doctors' cognition. We propose an interpretable automatic recognition framework for the ultrasonic standard plane, which uses a concept-based graph convolutional neural network to construct the relationships between key medical concepts, to obtain an interpretation consistent with a doctor's cognition. Extensive experiments have empirically shown that our model can meaningfully explain the decision of the classifier and provide quantitative support.  ( 2 min )
    Capturing Failures of Large Language Models via Human Cognitive Biases. (arXiv:2202.12299v1 [cs.CL])
    Large language models generate complex, open-ended outputs: instead of outputting a single class, they can write summaries, generate dialogue, and produce working code. In order to study the reliability of these open-ended systems, we must understand not just when they fail, but also how they fail. To approach this, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases to (i) identify inputs that models are likely to err on, and (ii) develop tests to qualitatively characterize their errors on these inputs. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to uncover high-impact errors such as incorrectly deleting files. Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.  ( 2 min )
    Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes. (arXiv:2111.04558v2 [stat.ML] UPDATED)
    Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of automatic relevance determination kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are very costly to run in both high-dimensional and large-$n$ datasets, or are only suitable for unsupervised settings with specific kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.  ( 2 min )
    Resampling Base Distributions of Normalizing Flows. (arXiv:2110.15828v2 [stat.ML] UPDATED)
    Normalizing flows are a popular class of models for approximating probability distributions. However, their invertible nature limits their ability to model target distributions whose support have a complex topological structure, such as Boltzmann distributions. Several procedures have been proposed to solve this problem but many of them sacrifice invertibility and, thereby, tractability of the log-likelihood as well as other desirable properties. To address these limitations, we introduce a base distribution for normalizing flows based on learned rejection sampling, allowing the resulting normalizing flow to model complicated distributions without giving up bijectivity. Furthermore, we develop suitable learning algorithms using both maximizing the log-likelihood and the optimization of the Kullback-Leibler divergence, and apply them to various sample problems, i.e. approximating 2D densities, density estimation of tabular data, image generation, and modeling Boltzmann distributions. In these experiments our method is competitive with or outperforms the baselines.  ( 2 min )
    Step-unrolled Denoising Autoencoders for Text Generation. (arXiv:2112.06749v2 [cs.CL] UPDATED)
    In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.  ( 2 min )
    Serverless Model Serving for Data Science. (arXiv:2103.02958v2 [cs.DC] UPDATED)
    Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems of model serving require high performance, low cost, and ease of management. Cloud providers are already offering model serving choices, including managed services and self-rented servers. Recently, serverless computing, whose advantages include high elasticity and a fine-grained cost model, brings another option for model serving. Our goal in this paper is to examine the viability of serverless as a mainstream model serving platform. To this end, we first conduct a comprehensive evaluation of the performance and cost of serverless against other model serving systems on Amazon Web Service and Google Cloud Platform. We find that serverless outperforms many cloud-based alternatives. Further, there are settings under which it even achieves better performance than GPU-based systems. Next, we present the design space of serverless model serving, which comprises multiple dimensions, including cloud platforms, serving runtimes, and other function-specific parameters. For each dimension, we analyze the impact of different choices and provide suggestions for data scientists to better utilize serverless model serving. Finally, we discuss challenges and opportunities in building a more practical serverless model serving system.  ( 2 min )
    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm. (arXiv:2102.12238v3 [cs.LG] UPDATED)
    We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an induced regularizer in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is independent of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor. We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.  ( 2 min )
    Flat latent manifolds for music improvisation between human and machine. (arXiv:2202.12243v1 [cs.SD])
    The use of machine learning in artistic music generation leads to controversial discussions of the quality of art, for which objective quantification is nonsensical. We therefore consider a music-generating algorithm as a counterpart to a human musician, in a setting where reciprocal improvisation is to lead to new experiences, both for the musician and the audience. To obtain this behaviour, we resort to the framework of recurrent Variational Auto-Encoders (VAE) and learn to generate music, seeded by a human musician. In the learned model, we generate novel musical sequences by interpolation in latent space. Standard VAEs however do not guarantee any form of smoothness in their latent representation. This translates into abrupt changes in the generated music sequences. To overcome these limitations, we regularise the decoder and endow the latent space with a flat Riemannian manifold, i.e., a manifold that is isometric to the Euclidean space. As a result, linearly interpolating in the latent space yields realistic and smooth musical changes that fit the type of machine--musician interactions we aim for. We provide empirical evidence for our method via a set of experiments on music datasets and we deploy our model for an interactive jam session with a professional drummer. The live performance provides qualitative evidence that the latent representation can be intuitively interpreted and exploited by the drummer to drive the interplay. Beyond the musical application, our approach showcases an instance of human-centred design of machine-learning models, driven by interpretability and the interaction with the end user.  ( 2 min )
    Sample Efficiency of Data Augmentation Consistency Regularization. (arXiv:2202.12230v1 [cs.LG])
    Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). We then propose a new theoretical framework for analyzing DAC, which reframes DAC as a way to reduce function class complexity. The new framework characterizes the sample efficiency of DAC for various non-linear models (e.g., neural networks). Further, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between ERM and consistency regularization using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.  ( 2 min )
    Overcoming a Theoretical Limitation of Self-Attention. (arXiv:2202.12172v1 [cs.LG])
    Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions become less and less confident (that is, with cross-entropy approaching 1 bit per string) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation suggested by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.  ( 2 min )
    Sequential Asset Ranking within Nonstationary Time Series. (arXiv:2202.12186v1 [cs.CE])
    Financial time series are both autocorrelated and nonstationary, presenting modelling challenges that violate the independent and identically distributed random variables assumption of most regression and classification models. The prediction with expert advice framework makes no assumptions on the data-generating mechanism yet generates predictions that work well for all sequences, with performance nearly as good as the best expert with hindsight. We conduct research using S&P 250 daily sampled data, extending the academic research into cross-sectional momentum trading strategies. We introduce a novel ranking algorithm from the prediction with expert advice framework, the naive Bayes asset ranker, to select subsets of assets to hold in either long-only or long/short portfolios. Our algorithm generates the best total returns and risk-adjusted returns, net of transaction costs, outperforming the long-only holding of the S&P 250 with hindsight. Furthermore, our ranking algorithm outperforms a proxy for the regress-then-rank cross-sectional momentum trader, a sequentially fitted curds and whey multivariate regression procedure.  ( 2 min )
    Clustering Edges in Directed Graphs. (arXiv:2202.12265v1 [cs.SI])
    How do vertices exert influence in graph data? We develop a framework for edge clustering, a new method for exploratory data analysis that reveals how both vertices and edges collaboratively accomplish directed influence in graphs, especially for directed graphs. In contrast to the ubiquitous vertex clustering which groups vertices, edge clustering groups edges. Edges sharing a functional affinity are assigned to the same group and form an influence subgraph cluster. With a complexity comparable to that of vertex clustering, this framework presents three different methods for edge spectral clustering that reveal important influence subgraphs in graph data, with each method providing different insight into directed influence processes. We present several diverse examples demonstrating the potential for widespread application of edge clustering in scientific research.  ( 2 min )
    Clarifying MCMC-based training of modern EBMs : Contrastive Divergence versus Maximum Likelihood. (arXiv:2202.12176v1 [cs.LG])
    The Energy-Based Model (EBM) framework is a very general approach to generative modeling that tries to learn and exploit probability distributions only defined though unnormalized scores. It has risen in popularity recently thanks to the impressive results obtained in image generation by parameterizing the distribution with Convolutional Neural Networks (CNN). However, the motivation and theoretical foundations behind modern EBMs are often absent from recent papers and this sometimes results in some confusion. In particular, the theoretical justifications behind the popular MCMC-based learning algorithm Contrastive Divergence (CD) are often glossed over and we find that this leads to theoretical errors in recent influential papers (Du & Mordatch, 2019; Du et al., 2020). After offering a first-principles introduction of MCMC-based training, we argue that the learning algorithm they use can in fact not be described as CD and reinterpret theirs methods in light of a new interpretation. Finally, we discuss the implications of our new interpretation and provide some illustrative experiments.  ( 2 min )
    Collaborative Training of Heterogeneous Reinforcement Learning Agents in Environments with Sparse Rewards: What and When to Share?. (arXiv:2202.12174v1 [cs.LG])
    In the early stages of human life, babies develop their skills by exploring different scenarios motivated by their inherent satisfaction rather than by extrinsic rewards from the environment. This behavior, referred to as intrinsic motivation, has emerged as one solution to address the exploration challenge derived from reinforcement learning environments with sparse rewards. Diverse exploration approaches have been proposed to accelerate the learning process over single- and multi-agent problems with homogeneous agents. However, scarce studies have elaborated on collaborative learning frameworks between heterogeneous agents deployed into the same environment, but interacting with different instances of the latter without any prior knowledge. Beyond the heterogeneity, each agent's characteristics grant access only to a subset of the full state space, which may hide different exploration strategies and optimal solutions. In this work we combine ideas from intrinsic motivation and transfer learning. Specifically, we focus on sharing parameters in actor-critic model architectures and on combining information obtained through intrinsic motivation with the aim of having a more efficient exploration and faster learning. We test our strategies through experiments performed over a modified ViZDooM's My Way Home scenario, which is more challenging than its original version and allows evaluating the heterogeneity between agents. Our results reveal different ways in which a collaborative framework with little additional computational cost can outperform an independent learning process without knowledge sharing. Additionally, we depict the need for modulating correctly the importance between the extrinsic and intrinsic rewards to avoid undesired agent behaviors.  ( 2 min )
    Machine Learning for Intrusion Detection in Industrial Control Systems: Applications, Challenges, and Recommendations. (arXiv:2202.11917v1 [cs.CR])
    Methods from machine learning are being applied to design Industrial Control Systems resilient to cyber-attacks. Such methods focus on two major areas: the detection of intrusions at the network-level using the information acquired through network packets, and detection of anomalies at the physical process level using data that represents the physical behavior of the system. This survey focuses on four types of methods from machine learning in use for intrusion and anomaly detection, namely, supervised, semi-supervised, unsupervised, and reinforcement learning. Literature available in the public domain was carefully selected, analyzed, and placed in a 7-dimensional space for ease of comparison. The survey is targeted at researchers, students, and practitioners. Challenges associated in using the methods and research gaps are identified and recommendations are made to fill the gaps.  ( 2 min )
    Self-Training: A Survey. (arXiv:2202.12040v1 [cs.LG])
    In recent years, semi-supervised algorithms have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have arguably received more attention in the last few years. These models are designed to search the decision boundary on low density regions without making extra assumptions on the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and train a new classifier in conjunction with the labeled training set. We present self-training methods for binary and multiclass classification and their variants which were recently developed using Neural Networks. Finally, we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.  ( 2 min )
    Embedded Ensembles: Infinite Width Limit and Operating Regimes. (arXiv:2202.12297v1 [stat.ML])
    A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.  ( 2 min )
    Bounding Membership Inference. (arXiv:2202.12232v1 [cs.LG])
    Differential Privacy (DP) is the de facto standard for reasoning about the privacy guarantees of a training algorithm. Despite the empirical observation that DP reduces the vulnerability of models to existing membership inference (MI) attacks, a theoretical underpinning as to why this is the case is largely missing in the literature. In practice, this means that models need to be trained with DP guarantees that greatly decrease their accuracy. In this paper, we provide a tighter bound on the accuracy of any MI adversary when a training algorithm provides $\epsilon$-DP. Our bound informs the design of a novel privacy amplification scheme, where an effective training set is sub-sampled from a larger set prior to the beginning of training, to greatly reduce the bound on MI accuracy. As a result, our scheme enables $\epsilon$-DP users to employ looser DP guarantees when training their model to limit the success of any MI adversary; this ensures that the model's accuracy is less impacted by the privacy guarantee. Finally, we discuss implications of our MI bound on the field of machine unlearning.  ( 2 min )
    Activation Functions: Dive into an optimal activation function. (arXiv:2202.12065v1 [cs.LG])
    Activation functions have come up as one of the essential components of neural networks. The choice of adequate activation function can impact the accuracy of these methods. In this study, we experiment for finding an optimal activation function by defining it as a weighted sum of existing activation functions and then further optimizing these weights while training the network. The study uses three activation functions, ReLU, tanh, and sin, over three popular image datasets, MNIST, FashionMNIST, and KMNIST. We observe that the ReLU activation function can easily overlook other activation functions. Also, we see that initial layers prefer to have ReLU or LeakyReLU type of activation functions, but deeper layers tend to prefer more convergent activation functions.  ( 2 min )
    Optimal Learning Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. (arXiv:2202.12119v1 [cs.LG])
    Convolutional neural networks have shown extraordinary abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The convergence rates are dimension independent. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.  ( 2 min )
    SQuadMDS: a lean Stochastic Quartet MDS improving global structure preservation in neighbor embedding like t-SNE and UMAP. (arXiv:2202.12087v1 [cs.LG])
    Multidimensional scaling is a statistical process that aims to embed high dimensional data into a lower-dimensional space; this process is often used for the purpose of data visualisation. Common multidimensional scaling algorithms tend to have high computational complexities, making them inapplicable on large data sets. This work introduces a stochastic, force directed approach to multidimensional scaling with a time and space complexity of O(N), with N data points. The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data. Experiments assess the quality of the embeddings produced by the standalone version and its hybrid extension both quantitatively and qualitatively, showing competitive results outperforming state-of-the-art approaches. Codes are available at https://github.com/PierreLambert3/SQuaD-MDS-and-FItSNE-hybrid.  ( 2 min )
    Debugging Differential Privacy: A Case Study for Privacy Auditing. (arXiv:2202.12219v1 [cs.LG])
    Differential Privacy can provide provable privacy guarantees for training data in machine learning. However, the presence of proofs does not preclude the presence of errors. Inspired by recent advances in auditing which have been used for estimating lower bounds on differentially private algorithms, here we show that auditing can also be used to find flaws in (purportedly) differentially private schemes. In this case study, we audit a recent open source implementation of a differentially private deep learning algorithm and find, with 99.99999999% confidence, that the implementation does not satisfy the claimed differential privacy guarantee.  ( 2 min )
    EMOTHAW: A novel database for emotional state recognition from handwriting. (arXiv:2202.12245v1 [cs.CV])
    The detection of negative emotions through daily activities such as handwriting is useful for promoting well-being. The spread of human-machine interfaces such as tablets makes the collection of handwriting samples easier. In this context, we present a first publicly available handwriting database which relates emotional states to handwriting, that we call EMOTHAW. This database includes samples of 129 participants whose emotional states, namely anxiety, depression and stress, are assessed by the Depression Anxiety Stress Scales (DASS) questionnaire. Seven tasks are recorded through a digitizing tablet: pentagons and house drawing, words copied in handprint, circles and clock drawing, and one sentence copied in cursive writing. Records consist in pen positions, on-paper and in-air, time stamp, pressure, pen azimuth and altitude. We report our analysis on this database. From collected data, we first compute measurements related to timing and ductus. We compute separate measurements according to the position of the writing device: on paper or in-air. We analyse and classify this set of measurements (referred to as features) using a random forest approach. This latter is a machine learning method [2], based on an ensemble of decision trees, which includes a feature ranking process. We use this ranking process to identify the features which best reveal a targeted emotional state. We then build random forest classifiers associated to each emotional state. Our results, obtained from cross-validation experiments, show that the targeted emotional states can be identified with accuracies ranging from 60% to 71%.  ( 2 min )
    BERTVision -- A Parameter-Efficient Approach for Question Answering. (arXiv:2202.12210v1 [cs.CL])
    We present a highly parameter efficient approach for Question Answering that significantly reduces the need for extended BERT fine-tuning. Our method uses information from the hidden state activations of each BERT transformer layer, which is discarded during typical BERT inference. Our best model achieves maximal BERT performance at a fraction of the training time and GPU or TPU expense. Performance is further improved by ensembling our model with BERTs predictions. Furthermore, we find that near optimal performance can be achieved for QA span annotation using less training data. Our experiments show that this approach works well not only for span annotation, but also for classification, suggesting that it may be extensible to a wider range of tasks.  ( 2 min )
    A comparative study of in-air trajectories at short and long distances in online handwriting. (arXiv:2202.12237v1 [cs.CV])
    Introduction Existing literature about online handwriting analysis to support pathology diagnosis has taken advantage of in-air trajectories. A similar situation occurred in biometric security applications where the goal is to identify or verify an individual using his signature or handwriting. These studies do not consider the distance of the pen tip to the writing surface. This is due to the fact that current acquisition devices do not provide height formation. However, it is quite straightforward to differentiate movements at two different heights: a) short distance: height lower or equal to 1 cm above a surface of digitizer, the digitizer provides x and y coordinates. b) long distance: height exceeding 1 cm, the only information available is a time stamp that indicates the time that a specific stroke has spent at long distance. Although short distance has been used in several papers, long distances have been ignored and will be investigated in this paper. Methods In this paper, we will analyze a large set of databases (BIOSECURID, EMOTHAW, PaHaW, Oxygen-Therapy and SALT), which contain a total amount of 663 users and 17951 files. We have specifically studied: a) the percentage of time spent on-surface, in-air at short distance, and in-air at long distance for different user profiles (pathological and healthy users) and different tasks; b) The potential use of these signals to improve classification rates. Results and conclusions Our experimental results reveal that long-distance movements represent a very small portion of the total execution time (0.5 % in the case of signatures and 10.4% for uppercase words of BIOSECUR-ID, which is the largest database). In addition, significant differences have been found in the comparison of pathological versus control group for letter l in PaHaW database (p=0.0157) and crossed pentagons in SALT database (p=0.0122)  ( 3 min )
    Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction. (arXiv:2102.01649v4 [cs.SI] UPDATED)
    Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies that generates multi-scale biomedical big data. Link prediction is increasingly used especially in bipartite biomedical networks to identify hidden biological interactions and relationshipts between key entities such as compounds, targets, gene and diseases. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). Our proposed GPLP method significantly outperforms over the state-of-the-art baselines. In addition, different network incompleteness is analysed with our devised protocol, and we also design an effective approach to improve the model robustness towards incomplete networks. Our method demonstrates the potential applications in other biomedical networks.  ( 2 min )
    On-line signature verification system with failure to enroll managing. (arXiv:2202.12242v1 [cs.CV])
    In this paper we simulate a real biometric verification system based on on-line signatures. For this purpose we have split the MCYT signature database in three subsets: one for classifier training, another for system adjustment and a third one for system testing simulating enrollment and verification. This context corresponds to a real operation, where a new user tries to enroll an existing system and must be automatically guided by the system in order to detect the failure to enroll situations. The main contribution of this work is the management of failure to enroll situations by means of a new proposal, called intelligent enrollment, which consists of consistency checking in order to automatically reject low quality samples. This strategy lets to enhance the verification errors up to 22% when leaving out 8% of the users. In this situation 8% of the people cannot be enrolled in the system and must be verified by other biometrics or by human abilities. These people are identified with intelligent enrollment and the situation can be thus managed. In addition we also propose a DCT-based feature extractor with threshold coding and discriminability criteria.  ( 2 min )
    Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing. (arXiv:2202.12028v1 [eess.SP])
    This paper studies the trajectory control and task offloading (TCTO) problem in an unmanned aerial vehicle (UAV)-assisted mobile edge computing system, where a UAV flies along a planned trajectory to collect computation tasks from smart devices (SDs). We consider a scenario that SDs are not directly connected by the base station (BS) and the UAV has two roles to play: MEC server or wireless relay. The UAV makes task offloading decisions online, in which the collected tasks can be executed locally on the UAV or offloaded to the BS for remote processing. The TCTO problem involves multi-objective optimization as its objectives are to minimize the task delay and the UAV's energy consumption, and maximize the number of tasks collected by the UAV, simultaneously. This problem is challenging because the three objectives conflict with each other. The existing reinforcement learning (RL) algorithms, either single-objective RLs or single-policy multi-objective RLs, cannot well address the problem since they cannot output multiple policies for various preferences (i.e. weights) across objectives in a single run. This paper adapts the evolutionary multi-objective RL (EMORL), a multi-policy multi-objective RL, to the TCTO problem. This algorithm can output multiple optimal policies in just one run, each optimizing a certain preference. The simulation results demonstrate that the proposed algorithm can obtain more excellent nondominated policies by striking a balance between the three objectives regarding policy quality, compared with two evolutionary and two multi-policy RL algorithms.  ( 2 min )
    Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques. (arXiv:2202.12139v1 [cs.SE])
    Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.  ( 2 min )
    Attention Enables Zero Approximation Error. (arXiv:2202.12166v1 [cs.LG])
    Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.  ( 2 min )
    A Transformer-based Network for Deformable Medical Image Registration. (arXiv:2202.12104v1 [eess.IV])
    Deformable medical image registration plays an important role in clinical diagnosis and treatment. Recently, the deep learning (DL) based image registration methods have been widely investigated and showed excellent performance in computational speed. However, these methods cannot provide enough registration accuracy because of insufficient ability in representing both the global and local features of the moving and fixed images. To address this issue, this paper has proposed the transformer based image registration method. This method uses the distinctive transformer to extract the global and local image features for generating the deformation fields, based on which the registered image is produced in an unsupervised way. Our method can improve the registration accuracy effectively by means of self-attention mechanism and bi-level information flow. Experimental results on such brain MR image datasets as LPBA40 and OASIS-1 demonstrate that compared with several traditional and DL based registration methods, our method provides higher registration accuracy in terms of dice values.  ( 2 min )
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v1 [cs.LG])
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee.  ( 2 min )
    Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (arXiv:2202.12154v1 [cs.CR])
    Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a simple trigger and targeting only one class to using many sophisticated triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make out-of-date assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. In this paper, we advocate general defenses that are effective and robust against various Trojan attacks and propose two novel "filtering" defenses with these characteristics called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF). VIF and AIF leverage variational inference and adversarial training respectively to purify all potential Trojan triggers in the input at run time without making any assumption about their numbers and forms. We further extend "filtering" to "filtering-then-contrasting" - a new defense mechanism that helps avoid the drop in classification accuracy on clean data caused by filtering. Extensive experimental results show that our proposed defenses significantly outperform 4 well-known defenses in mitigating 5 different Trojan attacks including the two state-of-the-art which defeat many strong defenses.  ( 2 min )
    Learning Stochastic Dynamics with Statistics-Informed Neural Network. (arXiv:2202.12278v1 [cs.LG])
    We introduce a machine-learning framework named statistics-informed neural network (SINN) for learning stochastic dynamics from data. This new architecture was theoretically inspired by a universal approximation theorem for stochastic systems introduced in this paper and the projection-operator formalism for stochastic modeling. We devise mechanisms for training the neural network model to reproduce the correct \emph{statistical} behavior of a target stochastic process. Numerical simulation results demonstrate that a well-trained SINN can reliably approximate both Markovian and non-Markovian stochastic dynamics. We demonstrate the applicability of SINN to model transition dynamics. Furthermore, we show that the obtained reduced-order model can be trained on temporally coarse-grained data and hence is well suited for rare-event simulations.  ( 2 min )
    Tighter Expected Generalization Error Bounds via Convexity of Information Measures. (arXiv:2202.12150v1 [cs.IT])
    Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.  ( 2 min )
    Investigating the Use of One-Class Support Vector Machine for Software Defect Prediction. (arXiv:2202.12074v1 [cs.SE])
    Early software defect identification is considered an important step towards software quality assurance. Software defect prediction aims at identifying software components that are likely to cause faults before a software is made available to the end-user. To date, this task has been modeled as a two-class classification problem, however its nature also allows it to be formulated as a one-class classification task. Preliminary results obtained in prior work show that One-Class Support Vector Machine (OCSVM) can outperform two-class classifiers for defect prediction. If confirmed, these results would overcome the data imbalance problem researchers have for long attempted to tackle in this field. In this paper, we further investigate whether learning from one class only is sufficient to produce effective defect prediction models by conducting a thorough large-scale empirical study investigating 15 real-world software projects, three validation scenarios, eight classifiers, robust evaluation measures and statistical significance tests. The results reveal that OCSVM is more suitable for cross-version and cross-project, rather than for within-project defect prediction, thus suggesting it performs better with heterogeneous data. While, we cannot conclude that OCSVM is the best classifier (Random Forest performs best herein), our results show interesting findings that open up further research avenues for training accurate defect prediction classifiers when defective instances are scarce or unavailable.  ( 2 min )
    Temporal Convolution Domain Adaptation Learning for Crops Growth Prediction. (arXiv:2202.12120v1 [cs.LG])
    Existing Deep Neural Nets on crops growth prediction mostly rely on availability of a large amount of data. In practice, it is difficult to collect enough high-quality data to utilize the full potential of these deep learning models. In this paper, we construct an innovative network architecture based on domain adaptation learning to predict crops growth curves with limited available crop data. This network architecture overcomes the challenge of data availability by incorporating generated data from the developed crops simulation model. We are the first to use the temporal convolution filters as the backbone to construct a domain adaptation network architecture which is suitable for deep learning regression models with very limited training data of the target domain. We conduct experiments to test the performance of the network and compare our proposed architecture with other state-of-the-art methods, including a recent LSTM-based domain adaptation network architecture. The results show that the proposed temporal convolution-based network architecture outperforms all benchmarks not only in accuracy but also in model size and convergence rate.  ( 2 min )
    Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review. (arXiv:2202.12205v1 [cs.AI])
    Advocates for Neuro-Symbolic AI (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, challenges and future directions, and aim to answer the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that knowledge encoded in relational structures and explicit reasoning tend to lead to more NeSy goals being satisfied. We also advocate for a more methodical approach to the application of theories of reasoning, which we hope can reduce some of the friction between the symbolic and sub-symbolic schools of AI.  ( 2 min )
    An optimal scheduled learning rate for a randomized Kaczmarz algorithm. (arXiv:2202.12224v1 [math.NA])
    We study how the learning rate affects the performance of a relaxed randomized Kaczmarz algorithm for solving $A x \approx b + \varepsilon$, where $A x =b$ is a consistent linear system and $\varepsilon$ has independent mean zero random entries. We derive a scheduled learning rate which optimizes a bound on the expected error that is sharp in certain cases; in contrast to the exponential convergence of the standard randomized Kaczmarz algorithm, our optimized bound involves the reciprocal of the Lambert-$W$ function of an exponential.  ( 2 min )
    Interfering Paths in Decision Trees: A Note on Deodata Predictors. (arXiv:2202.12064v1 [cs.LG])
    A technique for improving the prediction accuracy of decision trees is proposed. It consists in evaluating the tree's branches in parallel over multiple paths. The technique enables predictions that are more aligned with the ones generated by the nearest neighborhood variant of the deodata algorithms. The technique also enables the hybridization of the decision tree algorithm with the nearest neighborhood variant.  ( 2 min )
    Quantum Deep Reinforcement Learning for Robot Navigation Tasks. (arXiv:2202.12180v1 [cs.RO])
    In this work, we utilize Quantum Deep Reinforcement Learning as method to learn navigation tasks for a simple, wheeled robot in three simulated environments of increasing complexity. We show similar performance of a parameterized quantum circuit trained with well established deep reinforcement learning techniques in a hybrid quantum-classical setup compared to a classical baseline. To our knowledge this is the first demonstration of quantum machine learning (QML) for robotic behaviors. Thus, we establish robotics as a viable field of study for QML algorithms and henceforth quantum computing and quantum machine learning as potential techniques for future advancements in autonomous robotics. Beyond that, we discuss current limitations of the presented approach as well as future research directions in the field of quantum machine learning for autonomous robots.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v1 [eess.AS])
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    Dynamic Defense Against Byzantine Poisoning Attacks in Federated Learning. (arXiv:2007.15030v2 [cs.LG] UPDATED)
    Federated learning, as a distributed learning that conducts the training on the local devices without accessing to the training data, is vulnerable to Byzatine poisoning adversarial attacks. We argue that the federated learning model has to avoid those kind of adversarial attacks through filtering out the adversarial clients by means of the federated aggregation operator. We propose a dynamic federated aggregation operator that dynamically discards those adversarial clients and allows to prevent the corruption of the global learning model. We assess it as a defense against adversarial attacks deploying a deep learning classification model in a federated learning setting on the Fed-EMNIST Digits, Fashion MNIST and CIFAR-10 image datasets. The results show that the dynamic selection of the clients to aggregate enhances the performance of the global learning model and discards the adversarial and poor (with low quality models) clients.  ( 2 min )
    Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech. (arXiv:2202.12163v1 [eess.AS])
    In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, a simple domain adaptation mechanism is introduced to allow adapting an existing language identification model to a new domain where the prior language distribution is different. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-base models outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation significantly improve the model accuracy.  ( 2 min )
    Exploring the Unfairness of DP-SGD Across Settings. (arXiv:2202.12058v1 [cs.LG])
    End users and regulators require private and fair artificial intelligence models, but previous work suggests these objectives may be at odds. We use the CivilComments to evaluate the impact of applying the {\em de facto} standard approach to privacy, DP-SGD, across several fairness metrics. We evaluate three implementations of DP-SGD: for dimensionality reduction (PCA), linear classification (logistic regression), and robust deep learning (Group-DRO). We establish a negative, logarithmic correlation between privacy and fairness in the case of linear classification and robust deep learning. DP-SGD had no significant impact on fairness for PCA, but upon inspection, also did not seem to lead to private representations.  ( 2 min )
    All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL. (arXiv:2202.11960v1 [cs.LG])
    Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, off-policy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.  ( 2 min )
    Explanatory Paradigms in Neural Networks. (arXiv:2202.11838v1 [cs.LG])
    In this article, we present a leap-forward expansion to the study of explainability in neural networks by considering explanations as answers to abstract reasoning-based questions. With $P$ as the prediction from a neural network, these questions are `Why P?', `What if not P?', and `Why P, rather than Q?' for a given contrast prediction $Q$. The answers to these questions are observed correlations, observed counterfactuals, and observed contrastive explanations respectively. Together, these explanations constitute the abductive reasoning scheme. We term the three explanatory schemes as observed explanatory paradigms. The term observed refers to the specific case of post-hoc explainability, when an explanatory technique explains the decision $P$ after a trained neural network has made the decision $P$. The primary advantage of viewing explanations through the lens of abductive reasoning-based questions is that explanations can be used as reasons while making decisions. The post-hoc field of explainability, that previously only justified decisions, becomes active by being involved in the decision making process and providing limited, but relevant and contextual interventions. The contributions of this article are: ($i$) realizing explanations as reasoning paradigms, ($ii$) providing a probabilistic definition of observed explanations and their completeness, ($iii$) creating a taxonomy for evaluation of explanations, and ($iv$) positioning gradient-based complete explanainability's replicability and reproducibility across multiple applications and data modalities, ($v$) code repositories, publicly available at https://github.com/olivesgatech/Explanatory-Paradigms.  ( 2 min )
    "Is not the truth the truth?": Analyzing the Impact of User Validations for Bus In/Out Detection in Smartphone-based Surveys. (arXiv:2202.11961v1 [cs.HC])
    Passenger flow allows the study of users' behavior through the public network and assists in designing new facilities and services. This flow is observed through interactions between passengers and infrastructure. For this task, Bluetooth technology and smartphones represent the ideal solution. The latter component allows users' identification, authentication, and billing, while the former allows short-range implicit interactions, device-to-device. To assess the potential of such a use case, we need to verify how robust Bluetooth signal and related machine learning (ML) classifiers are against the noise of realistic contexts. Therefore, we model binary passenger states with respect to a public vehicle, where one can either be-in or be-out (BIBO). The BIBO label identifies a fundamental building block of continuously-valued passenger flow. This paper describes the Human-Computer interaction experimental setting in a semi-controlled environment, which involves: two autonomous vehicles operating on two routes, serving three bus stops and eighteen users, as well as a proprietary smartphone-Bluetooth sensing platform. The resulting dataset includes multiple sensors' measurements of the same event and two ground-truth levels, the first being validation by participants, the second by three video-cameras surveilling buses and track. We performed a Monte-Carlo simulation of labels-flip to emulate human errors in the labeling process, as is known to happen in smartphone surveys; next we used such flipped labels for supervised training of ML classifiers. The impact of errors on model performance bias can be large. Results show ML tolerance to label flips caused by human or machine errors up to 30%.  ( 2 min )
    A fair pricing model via adversarial learning. (arXiv:2202.12008v1 [stat.ML])
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.  ( 2 min )
    Can deep neural networks learn process model structure? An assessment framework and analysis. (arXiv:2202.11985v1 [cs.LG])
    Predictive process monitoring concerns itself with the prediction of ongoing cases in (business) processes. Prediction tasks typically focus on remaining time, outcome, next event or full case suffix prediction. Various methods using machine and deep learning havebeen proposed for these tasks in recent years. Especially recurrent neural networks (RNNs) such as long short-term memory nets (LSTMs) have gained in popularity. However, no research focuses on whether such neural network-based models can truly learn the structure of underlying process models. For instance, can such neural networks effectively learn parallel behaviour or loops? Therefore, in this work, we propose an evaluation scheme complemented with new fitness, precision, and generalisation metrics, specifically tailored towards measuring the capacity of deep learning models to learn process model structure. We apply this framework to several process models with simple control-flow behaviour, on the task of next-event prediction. Our results show that, even for such simplistic models, careful tuning of overfitting countermeasures is required to allow these models to learn process model structure.  ( 2 min )
    Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge. (arXiv:2202.12031v1 [cs.CV])
    Polyps are well-known cancer precursors identified by colonoscopy. However, variability in their size, location, and surface largely affect identification, localisation, and characterisation. Moreover, colonoscopic surveillance and removal of polyps (referred to as polypectomy ) are highly operator-dependent procedures. There exist a high missed detection rate and incomplete removal of colonic polyps due to their variable nature, the difficulties to delineate the abnormality, the high recurrence rates, and the anatomical topography of the colon. There have been several developments in realising automated methods for both detection and segmentation of these polyps using machine learning. However, the major drawback in most of these methods is their ability to generalise to out-of-sample unseen datasets that come from different centres, modalities and acquisition systems. To test this hypothesis rigorously we curated a multi-centre and multi-population dataset acquired from multiple colonoscopy systems and challenged teams comprising machine learning experts to develop robust automated detection and segmentation methods as part of our crowd-sourcing Endoscopic computer vision challenge (EndoCV) 2021. In this paper, we analyse the detection results of the four top (among seven) teams and the segmentation results of the five top teams (among 16). Our analyses demonstrate that the top-ranking teams concentrated on accuracy (i.e., accuracy > 80% on overall Dice score on different validation sets) over real-time performance required for clinical applicability. We further dissect the methods and provide an experiment-based hypothesis that reveals the need for improved generalisability to tackle diversity present in multi-centre datasets.  ( 3 min )
    DC and SA: Robust and Efficient Hyperparameter Optimization of Multi-subnetwork Deep Learning Models. (arXiv:2202.11841v1 [cs.LG])
    We present two novel hyperparameter optimization strategies for optimization of deep learning models with a modular architecture constructed of multiple subnetworks. As complex networks with multiple subnetworks become more frequently applied in machine learning, hyperparameter optimization methods are required to efficiently optimize their hyperparameters. Existing hyperparameter searches are general, and can be used to optimize such networks, however, by exploiting the multi-subnetwork architecture, these searches can be sped up substantially. The proposed methods offer faster convergence to a better-performing final model. To demonstrate this, we propose 2 independent approaches to enhance these prior algorithms: 1) a divide-and-conquer approach, in which the best subnetworks of top-performing models are combined, allowing for more rapid sampling of the hyperparameter search space. 2) A subnetwork adaptive approach that distributes computational resources based on the importance of each subnetwork, allowing more intelligent resource allocation. These approaches can be flexibily applied to many hyperparameter optimization algorithms. To illustrate this, we combine our approaches with the commonly-used Bayesian optimization method. Our approaches are then tested against both synthetic examples and real-world examples and applied to multiple network types including convolutional neural networks and dense feed forward neural networks. Our approaches show an increased optimization efficiency of up to 23.62x, and a final performance boost of up to 3.5% accuracy for classification and 4.4 MSE for regression, when compared to comparable BO approach.  ( 2 min )
    A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions. (arXiv:2202.11912v1 [cs.LG])
    As the efficacy of deep learning (DL) grows, so do concerns about the lack of transparency of these black-box models. Attribution methods aim to improve transparency of DL models by quantifying an input feature's importance to a model's prediction. The method of Integrated gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfied said axioms. This paper comments on fundamental aspects of IG and its applications/extensions: 1) We identify key unaddressed differences between DL-attribution function spaces and the supporting literature's function spaces which problematize previous claims of IG uniqueness. We show that with the introduction of an additional axiom, $\textit{non-decreasing positivity}$, the uniqueness claim can be established. 2) We address the question of input sensitivity by identifying function spaces where the IG is/is not Lipschitz continuous in the attributed input. 3) We show how axioms for single-baseline methods in IG impart analogous properties for methods where the baseline is a probability distribution over the input sample space. 4) We introduce a means of decomposing the IG map with respect to a layer of internal neurons while simultaneously gaining internal-neuron attributions. Finally, we present experimental results validating the decomposition and internal neuron attributions.  ( 2 min )
    Predicting the impact of treatments over time with uncertainty aware neural differential equations. (arXiv:2202.11987v1 [cs.LG])
    Predicting the impact of treatments from observational data only still represents a majorchallenge despite recent significant advances in time series modeling. Treatment assignments are usually correlated with the predictors of the response, resulting in a lack of data support for counterfactual predictions and therefore in poor quality estimates. Developments in causal inference have lead to methods addressing this confounding by requiring a minimum level of overlap. However,overlap is difficult to assess and usually notsatisfied in practice. In this work, we propose Counterfactual ODE (CF-ODE), a novel method to predict the impact of treatments continuously over time using Neural Ordinary Differential Equations equipped with uncertainty estimates. This allows to specifically assess which treatment outcomes can be reliably predicted. We demonstrate over several longitudinal data sets that CF-ODE provides more accurate predictions and more reliable uncertainty estimates than previously available methods.  ( 2 min )
    On Learning Mixture Models with Sparse Parameters. (arXiv:2202.11940v1 [cs.LG])
    Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation. We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on the problem while in the rest, our results provide improvements on existing works.  ( 2 min )
    Drawing Inductor Layout with a Reinforcement Learning Agent: Method and Application for VCO Inductors. (arXiv:2202.11798v1 [cs.AI])
    Design of Voltage-Controlled Oscillator (VCO) inductors is a laborious and time-consuming task that is conventionally done manually by human experts. In this paper, we propose a framework for automating the design of VCO inductors, using Reinforcement Learning (RL). We formulate the problem as a sequential procedure, where wire segments are drawn one after another, until a complete inductor is created. We then employ an RL agent to learn to draw inductors that meet certain target specifications. In light of the need to tweak the target specifications throughout the circuit design cycle, we also develop a variant in which the agent can learn to quickly adapt to draw new inductors for moderately different target specifications. Our empirical results show that the proposed framework is successful at automatically generating VCO inductors that meet or exceed the target specification.  ( 2 min )
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v1 [cs.LG])
    Recently a parametric family of fairness metrics to quantify algorithmic fairness has been proposed based on generalized entropy which have been originally used in economics and public welfare. Since these metrics have several advantages such as quantifying unfairness at the individual-level and group-level, and unfold trade-off between the individual fairness and group-level fairness, algorithmic fairness requirement may be given in terms of generalized entropy for a fair classification problem. We consider a fair empirical risk minimization with a fairness constraint specified by generalized entropy. We theoretically investigate if the fair empirical fair classification problem is learnable and how to find an approximate optimal classifier of it.  ( 2 min )
    Rare Gems: Finding Lottery Tickets at Initialization. (arXiv:2202.12002v1 [cs.LG])
    It has been widely observed that large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by typically following a time-consuming "train, prune, re-train" approach. Frankle & Carbin (2018) conjecture that we can avoid this by training lottery tickets, i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we partially resolve this open problem by discovering rare gems: subnetworks at initialization that attain considerable accuracy, even before training. Refining these rare gems - "by means of fine-tuning" - beats current baselines and leads to accuracy competitive or better than magnitude pruning methods.  ( 2 min )
    AutoCl : A Visual Interactive System for Automatic Deep Learning Classifier Recommendation Based on Models Performance. (arXiv:2202.11928v1 [cs.LG])
    Nowadays, deep learning (DL) models being increasingly applied to various fields, people without technical expertise and domain knowledge struggle to find an appropriate model for their task. In this paper, we introduce AutoCl a visual interactive recommender system aimed at helping non-experts to adopt an appropriate DL classifier. Our system enables users to compare the performance and behavior of multiple classifiers trained with various hyperparameter setups as well as automatically recommends a best classifier with appropriate hyperparameter. We compare features of AutoCl against several recent AutoML systems and show that it helps non-experts better in choosing DL classifier. Finally, we demonstrate use cases for image classification using publicly available dataset to show the capability of our system.  ( 2 min )
    Physics-informed neural networks for inverse problems in supersonic flows. (arXiv:2202.11821v1 [math.NA])
    Accurate solutions to inverse supersonic compressible flow problems are often required for designing specialized aerospace vehicles. In particular, we consider the problem where we have data available for density gradients from Schlieren photography as well as data at the inflow and part of wall boundaries. These inverse problems are notoriously difficult and traditional methods may not be adequate to solve such ill-posed inverse problems. To this end, we employ the physics-informed neural networks (PINNs) and its extended version, extended PINNs (XPINNs), where domain decomposition allows deploying locally powerful neural networks in each subdomain, which can provide additional expressivity in subdomains, where a complex solution is expected. Apart from the governing compressible Euler equations, we also enforce the entropy conditions in order to obtain viscosity solutions. Moreover, we enforce positivity conditions on density and pressure. We consider inverse problems involving two-dimensional expansion waves, two-dimensional oblique and bow shock waves. We compare solutions obtained by PINNs and XPINNs and invoke some theoretical results that can be used to decide on the generalization errors of the two methods.  ( 2 min )
    Learning Multi-Object Dynamics with Compositional Neural Radiance Fields. (arXiv:2202.11855v1 [cs.CV])
    We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. A central question in learning dynamic models from sensor observations is on which representations predictions should be performed. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. However, most NeRF approaches are trained on a single scene, representing the whole scene with a global model, making generalization to novel scenes, containing different numbers of objects, challenging. Instead, we present a compositional, object-centric auto-encoder framework that maps multiple views of the scene to a \emph{set} of latent vectors representing each object separately. The latent vectors parameterize individual NeRF models from which the scene can be reconstructed and rendered from novel viewpoints. We train a graph neural network dynamics model in the latent space to achieve compositionality for dynamics prediction. A key feature of our approach is that the learned 3D information of the scene through the NeRF model enables us to incorporate structural priors in learning the dynamics models, making long-term predictions more stable. The model can further be used to synthesize new scenes from individual object observations. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient. In the experiments, we show that the model outperforms several baselines on a pushing task containing many objects. Video: https://dannydriess.github.io/compnerfdyn/  ( 2 min )
    Threading the Needle of On and Off-Manifold Value Functions for Shapley Explanations. (arXiv:2202.11919v1 [cs.LG])
    A popular explainable AI (XAI) approach to quantify feature importance of a given model is via Shapley values. These Shapley values arose in cooperative games, and hence a critical ingredient to compute these in an XAI context is a so-called value function, that computes the "value" of a subset of features, and which connects machine learning models to cooperative games. There are many possible choices for such value functions, which broadly fall into two categories: on-manifold and off-manifold value functions, which take an observational and an interventional viewpoint respectively. Both these classes however have their respective flaws, where on-manifold value functions violate key axiomatic properties and are computationally expensive, while off-manifold value functions pay less heed to the data manifold and evaluate the model on regions for which it wasn't trained. Thus, there is no consensus on which class of value functions to use. In this paper, we show that in addition to these existing issues, both classes of value functions are prone to adversarial manipulations on low density regions. We formalize the desiderata of value functions that respect both the model and the data manifold in a set of axioms and are robust to perturbation on off-manifold regions, and show that there exists a unique value function that satisfies these axioms, which we term the Joint Baseline value function, and the resulting Shapley value the Joint Baseline Shapley (JBshap), and validate the effectiveness of JBshap in experiments.  ( 2 min )
    Validating an SVM-based neonatal seizure detection algorithm for generalizability, non-inferiority and clinical efficacy. (arXiv:2202.12023v1 [cs.LG])
    Neonatal seizure detection algorithms (SDA) are approaching the benchmark of human expert annotation. Measures of algorithm generalizability and non-inferiority as well as measures of clinical efficacy are needed to assess the full scope of neonatal SDA performance. We validated our neonatal SDA on an independent data set of 28 neonates. Generalizability was tested by comparing the performance of the original training set (cross-validation) to its performance on the validation set. Non-inferiority was tested by assessing inter-observer agreement between combinations of SDA and two human expert annotations. Clinical efficacy was tested by comparing how the SDA and human experts quantified seizure burden and identified clinically significant periods of seizure activity in the EEG. Algorithm performance was consistent between training and validation sets with no significant worsening in AUC (p>0.05, n =28). SDA output was inferior to the annotation of the human expert, however, re-training with an increased diversity of data resulted in non-inferior performance ($\Delta\kappa$=0.077, 95% CI: -0.002-0.232, n=18). The SDA assessment of seizure burden had an accuracy ranging from 89-93%, and 87% for identifying periods of clinical interest. The proposed SDA is approaching human equivalence and provides a clinically relevant interpretation of the EEG.  ( 2 min )
    A general framework for adaptive two-index fusion attribute weighted naive Bayes. (arXiv:2202.11963v1 [cs.LG])
    Naive Bayes(NB) is one of the essential algorithms in data mining. However, it is rarely used in reality because of the attribute independent assumption. Researchers have proposed many improved NB methods to alleviate this assumption. Among these methods, due to high efficiency and easy implementation, the filter attribute weighted NB methods receive great attentions. However, there still exists several challenges, such as the poor representation ability for single index and the fusion problem of two indexes. To overcome above challenges, we propose a general framework for Adaptive Two-index Fusion attribute weighted NB(ATFNB). Two types of data description category are used to represent the correlation between classes and attributes, intercorrelation between attributes and attributes, respectively. ATFNB can select any one index from each category. Then, we introduce a switching factor \{beta} to fuse two indexes, which can adaptively adjust the optimal ratio of the two index on various datasets. And a quick algorithm is proposed to infer the optimal interval of switching factor \{beta}. Finally, the weight of each attribute is calculated using the optimal value \{beta} and is integrated into NB classifier to improve the accuracy. The experimental results on 50 benchmark datasets and a Flavia dataset show that ATFNB outperforms the basic NB and state-of-the-art filter weighted NB models. In addition, the ATFNB framework can improve the existing two-index NB model by introducing the adaptive switching factor \{beta}. Auxiliary experimental results demonstrate the improved model significantly increases the accuracy compared to the original model without the adaptive switching factor \{beta}.  ( 2 min )
    XAutoML: A Visual Analytics Tool for Establishing Trust in Automated Machine Learning. (arXiv:2202.11954v1 [cs.LG])
    In the last ten years, various automated machine learning (AutoML) systems have been proposed to build end-to-end machine learning (ML) pipelines with minimal human interaction. Even though such automatically synthesized ML pipelines are able to achieve a competitive performance, recent studies have shown that users do not trust models constructed by AutoML due to missing transparency of AutoML systems and missing explanations for the constructed ML pipelines. In a requirements analysis study with 26 domain experts, data scientists, and AutoML researchers from different professions with vastly different expertise in ML, we collect detailed informational needs to establish trust in AutoML. We propose XAutoML, an interactive visual analytics tool for explaining arbitrary AutoML optimization procedures and ML pipelines constructed by AutoML. XAutoML combines interactive visualizations with established techniques from explainable artificial intelligence (XAI) to make the complete AutoML procedure transparent and explainable. By integrating XAutoML with JupyterLab, experienced users can extend the visual analytics with ad-hoc visualizations based on information extracted from XAutoML. We validate our approach in a user study with the same diverse user group from the requirements analysis. All participants were able to extract useful information from XAutoML, leading to a significantly increased trust in ML pipelines produced by AutoML and the AutoML optimization itself.  ( 2 min )
    Sky Computing: Accelerating Geo-distributed Computing in Federated Learning. (arXiv:2202.11836v1 [cs.LG])
    Federated learning is proposed by Google to safeguard data privacy through training models locally on users' devices. However, with deep learning models growing in size to achieve better results, it becomes increasingly difficult to accommodate the whole model on one single device. Thus, model parallelism is then used to divide the model weights among several devices. With this logic, the approach currently used evenly allocates weights among devices. However, in reality, a computation bottleneck may occur resulting from variant computing power of different users' devices. To address this problem, load balancing is needed to allocate the model weights based on the computational capability of the device. In this paper, we proposed Sky Computing, a load-balanced model parallelism framework to adaptively allocate the weights to devices. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT with 64 nodes. The source code can be found at https://github.com/hpcaitech/SkyComputing.  ( 2 min )
    Large Scale Passenger Detection with Smartphone/Bus Implicit Interaction and Multisensory Unsupervised Cause-effect Learning. (arXiv:2202.11962v1 [cs.LG])
    Intelligent Transportation Systems (ITS) underpin the concept of Mobility as a Service (MaaS), which requires universal and seamless users' access across multiple public and private transportation systems while allowing operators' proportional revenue sharing. Current user sensing technologies such as Walk-in/Walk-out (WIWO) and Check-in/Check-out (CICO) have limited scalability for large-scale deployments. These limitations prevent ITS from supporting analysis, optimization, calculation of revenue sharing, and control of MaaS comfort, safety, and efficiency. We focus on the concept of implicit Be-in/Be-out (BIBO) smartphone-sensing and classification. To close the gap and enhance smartphones towards MaaS, we developed a proprietary smartphone-sensing platform collecting contemporary Bluetooth Low Energy (BLE) signals from BLE devices installed on buses and Global Positioning System (GPS) locations of both buses and smartphones. To enable the training of a model based on GPS features against the BLE pseudo-label, we propose the Cause-Effect Multitask Wasserstein Autoencoder (CEMWA). CEMWA combines and extends several frameworks around Wasserstein autoencoders and neural networks. As a dimensionality reduction tool, CEMWA obtains an auto-validated representation of a latent space describing users' smartphones within the transport system. This representation allows BIBO clustering via DBSCAN. We perform an ablation study of CEMWA's alternative architectures and benchmark against the best available supervised methods. We analyze performance's sensitivity to label quality. Under the na\"ive assumption of accurate ground truth, XGBoost outperforms CEMWA. Although XGBoost and Random Forest prove to be tolerant to label noise, CEMWA is agnostic to label noise by design and provides the best performance with an 88\% F1 score.  ( 2 min )
    Fine-grained TLS Services Classification with Reject Option. (arXiv:2202.11984v1 [cs.LG])
    The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Therefore, this paper focuses on collecting a large up-to-date dataset with almost 200 fine-grained service labels and 140 million network flows extended with packet-level metadata. The number of flows is three orders of magnitude higher than in other existing public labeled datasets of encrypted traffic. The number of service labels, which is important to make the problem hard and realistic, is four times higher than in the public dataset with the most class labels. The published dataset is intended as a benchmark for identifying services in encrypted traffic. Service identification can be further extended with the task of "rejecting" unknown services, i.e., the traffic not seen during the training phase. Neural networks offer superior performance for tackling this more challenging problem. To showcase the dataset's usefulness, we implemented a neural network with a multi-modal architecture, which is the state-of-the-art approach, and achieved 97.04% classification accuracy and detected 91.94% of unknown services with 5% false positive rate.  ( 2 min )
    A Unified Framework for Campaign Performance Forecasting in Online Display Advertising. (arXiv:2202.11877v1 [cs.LG])
    Advertisers usually enjoy the flexibility to choose criteria like target audience, geographic area and bid price when planning an campaign for online display advertising, while they lack forecast information on campaign performance to optimize delivery strategies in advance, resulting in a waste of labour and budget for feedback adjustments. In this paper, we aim to forecast key performance indicators for new campaigns given any certain criteria. Interpretable and accurate results could enable advertisers to manage and optimize their campaign criteria. There are several challenges for this very task. First, platforms usually offer advertisers various criteria when they plan an advertising campaign, it is difficult to estimate campaign performance unifiedly because of the great difference among bidding types. Furthermore, complex strategies applied in bidding system bring great fluctuation on campaign performance, making estimation accuracy an extremely tough problem. To address above challenges, we propose a novel Campaign Performance Forecasting framework, which firstly reproduces campaign performance on historical logs under various bidding types with a unified replay algorithm, in which essential auction processes like match and rank are replayed, ensuring the interpretability on forecast results. Then, we innovatively introduce a multi-task learning method to calibrate the deviation of estimation brought by hard-to-reproduce bidding strategies in replay. The method captures mixture calibration patterns among related forecast indicators to map the estimated results to the true ones, improving both accuracy and efficiency significantly. Experiment results on a dataset from Taobao.com demonstrate that the proposed framework significantly outperforms other baselines by a large margin, and an online A/B test verifies its effectiveness in the real world.  ( 2 min )
    Learning to Merge Tokens in Vision Transformers. (arXiv:2202.12015v1 [cs.CV])
    Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In order for large-scale models to remain practical in real-world systems, there is a need for reducing their computational overhead. In this work, we present the PatchMerger, a simple module that reduces the number of patches or tokens the network has to process by merging them between two consecutive intermediate layers. We show that the PatchMerger achieves a significant speedup across various model sizes while matching the original performance both upstream and downstream after fine-tuning.  ( 2 min )
    Counterfactual Explanations for Predictive Business Process Monitoring. (arXiv:2202.12018v1 [cs.AI])
    Predictive business process monitoring increasingly leverages sophisticated prediction models. Although sophisticated models achieve consistently higher prediction accuracy than simple models, one major drawback is their lack of interpretability, which limits their adoption in practice. We thus see growing interest in explainable predictive business process monitoring, which aims to increase the interpretability of prediction models. Existing solutions focus on giving factual explanations.While factual explanations can be helpful, humans typically do not ask why a particular prediction was made, but rather why it was made instead of another prediction, i.e., humans are interested in counterfactual explanations. While research in explainable AI produced several promising techniques to generate counterfactual explanations, directly applying them to predictive process monitoring may deliver unrealistic explanations, because they ignore the underlying process constraints. We propose LORELEY, a counterfactual explanation technique for predictive process monitoring, which extends LORE, a recent explainable AI technique. We impose control flow constraints to the explanation generation process to ensure realistic counterfactual explanations. Moreover, we extend LORE to enable explaining multi-class classification models. Experimental results using a real, public dataset indicate that LORELEY can approximate the prediction models with an average fidelity of 97.69\% and generate realistic counterfactual explanations.  ( 2 min )
    Attainability and Optimality: The Equalized Odds Fairness Revisited. (arXiv:2202.11853v1 [cs.LG])
    Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches have been proposed to impose fairness. Given a notion of fairness, an essential problem is then whether or not it can always be attained, even if with an unlimited amount of data. This issue is, however, not well addressed yet. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion and, furthermore, if it is attainable, the optimality of the prediction performance under various settings. In particular, for prediction performed by a deterministic function of input features, we give conditions under which Equalized Odds can hold true; if the stochastic prediction is acceptable, we show that under mild assumptions, fair predictors can always be derived. For classification, we further prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get potentially better prediction performance while remaining fair. Moreover, while stochastic prediction can attain Equalized Odds with theoretical guarantees, we also discuss its limitation and potential negative social impacts.  ( 2 min )
    No-Regret Learning in Games is Turing Complete. (arXiv:2202.11871v1 [cs.GT])
    Games are natural models for multi-agent machine learning settings, such as generative adversarial networks (GANs). The desirable outcomes from algorithmic interactions in these games are encoded as game theoretic equilibrium concepts, e.g. Nash and coarse correlated equilibria. As directly computing an equilibrium is typically impractical, one often aims to design learning algorithms that iteratively converge to equilibria. A growing body of negative results casts doubt on this goal, from non-convergence to chaotic and even arbitrary behaviour. In this paper we add a strong negative result to this list: learning in games is Turing complete. Specifically, we prove Turing completeness of the replicator dynamic on matrix games, one of the simplest possible settings. Our results imply the undecicability of reachability problems for learning algorithms in games, a special case of which is determining equilibrium convergence.  ( 2 min )
    Auto-scaling Vision Transformers without Training. (arXiv:2202.11921v1 [cs.LG])
    This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.  ( 2 min )
    A Note on Machine Learning Approach for Computational Imaging. (arXiv:2202.11883v1 [eess.IV])
    Computational imaging has been playing a vital role in the development of natural sciences. Advances in sensory, information, and computer technologies have further extended the scope of influence of imaging, making digital images an essential component of our daily lives. For the past three decades, we have witnessed phenomenal developments of mathematical and machine learning methods in computational imaging. In this note, we will review some of the recent developments of the machine learning approach for computational imaging and discuss its differences and relations to the mathematical approach. We will demonstrate how we may combine the wisdom from both approaches, discuss the merits and potentials of such a combination and present some of the new computational and theoretical challenges it brings about.  ( 2 min )
    First is Better Than Last for Training Data Influence. (arXiv:2202.11844v1 [cs.LG])
    The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques are based on the flow of influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. Our first observation is that for classification problems, the last layer is reductive and does not encode sufficient input level information. Deleting influential examples, according to this measure, typically does not change the model's behavior much. We propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer. This could potentially have the opposite concern, that the word embedding layer does not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer by 4-10 times on the case deletion evaluation on three language classification tasks. In addition, TracIn-WE can produce scores not just at the training data level, but at the word training data level, a further aid in debugging.  ( 2 min )
    Differentially Private Speaker Anonymization. (arXiv:2202.11823v1 [cs.SD])
    Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, differentially private utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.  ( 2 min )
    Consistent Dropout for Policy Gradient Reinforcement Learning. (arXiv:2202.11818v1 [cs.LG])
    Dropout has long been a staple of supervised learning, but is rarely used in reinforcement learning. We analyze why naive application of dropout is problematic for policy-gradient learning algorithms and introduce consistent dropout, a simple technique to address this instability. We demonstrate consistent dropout enables stable training with A2C and PPO in both continuous and discrete action environments across a wide range of dropout probabilities. Finally, we show that consistent dropout enables the online training of complex architectures such as GPT without needing to disable the model's native dropout.  ( 2 min )
    Robust Probabilistic Time Series Forecasting. (arXiv:2202.11910v1 [cs.LG])
    Probabilistic time series forecasting has played critical role in decision-making processes due to its capability to quantify uncertainties. Deep forecasting models, however, could be prone to input perturbations, and the notion of such perturbations, together with that of robustness, has not even been completely established in the regime of probabilistic forecasting. In this work, we propose a framework for robust probabilistic time series forecasting. First, we generalize the concept of adversarial input perturbations, based on which we formulate the concept of robustness in terms of bounded Wasserstein deviation. Then we extend the randomized smoothing technique to attain robust probabilistic forecasters with theoretical robustness certificates against certain classes of adversarial perturbations. Lastly, extensive experiments demonstrate that our methods are empirically effective in enhancing the forecast quality under additive adversarial attacks and forecast consistency under supplement of noisy observations.  ( 2 min )
    Analysis of Coronavirus Envelope Protein with Cellular Automata (CA) Model. (arXiv:2202.11752v1 [q-bio.BM])
    The reason of significantly higher transmissibility of SARS Covid (2019 CoV-2) compared to SARS Covid (2003 CoV) and MERS Covid (2012 MERS) can be attributed to mutations reported in structural proteins, and the role played by non-structural proteins (nsps) and accessory proteins (ORFs) for viral replication, assembly, and shedding. Envelope protein E is one of the four structural proteins of minimum length. Recent studies have confirmed critical role played by the envelope protein in the viral life cycle including assembly of virion exported from infected cell for its transmission. However, the determinants of the highly complex viral - host interactions of envelope protein, particularly with host Golgi complex, have not been adequately characterized. CoV-2 and CoV Envelope proteins of length 75 and 76 amino acids differ in four amino acid locations. The additional amino acid Gly (G) at location 70 makes CoV length 76. The amino acid pair EG at location 69-70 of CoV in place of amino acid R in location 69 of CoV-2, has been identified as a major determining factor in the current investigation. This paper concentrates on the design of computational model to compare the structure/function of wild and mutants of CoV-2 with wild and mutants of CoV in the functionally important region of the protein chain pair. We hypothesize that differences of CAML model parameter of CoV-2 and CoV characterize the deviation in structure and function of envelope proteins in respect of interaction of virus with host Golgi complex; and this difference gets reflected in the difference of their transmissibility. The hypothesis has been validated from single point mutational study on- (i) human HBB beta-globin hemoglobin protein associated with sickle cell anemia, (ii) mutants of envelope protein of Covid-2 infected patients reported in recent publications.  ( 3 min )
    Controlling Memorability of Face Images. (arXiv:2202.11896v1 [cs.CV])
    Everyday, we are bombarded with many photographs of faces, whether on social media, television, or smartphones. From an evolutionary perspective, faces are intended to be remembered, mainly due to survival and personal relevance. However, all these faces do not have the equal opportunity to stick in our minds. It has been shown that memorability is an intrinsic feature of an image but yet, it is largely unknown what attributes make an image more memorable. In this work, we aimed to address this question by proposing a fast approach to modify and control the memorability of face images. In our proposed method, we first found a hyperplane in the latent space of StyleGAN to separate high and low memorable images. We then modified the image memorability (while maintaining the identity and other facial features such as age, emotion, etc.) by moving in the positive or negative direction of this hyperplane normal vector. We further analyzed how different layers of the StyleGAN augmented latent space contribute to face memorability. These analyses showed how each individual face attribute makes an image more or less memorable. Most importantly, we evaluated our proposed method for both real and synthesized face images. The proposed method successfully modifies and controls the memorability of real human faces as well as unreal synthesized faces. Our proposed method can be employed in photograph editing applications for social media, learning aids, or advertisement purposes.  ( 2 min )
    The Need for Interpretable Features: Motivation and Taxonomy. (arXiv:2202.11748v1 [cs.LG])
    Through extensive experience developing and explaining machine learning (ML) applications for real-world domains, we have learned that ML models are only as interpretable as their features. Even simple, highly interpretable model types such as regression models can be difficult or impossible to understand if they use uninterpretable features. Different users, especially those using ML models for decision-making in their domains, may require different levels and types of feature interpretability. Furthermore, based on our experiences, we claim that the term "interpretable feature" is not specific nor detailed enough to capture the full extent to which features impact the usefulness of ML explanations. In this paper, we motivate and discuss three key lessons: 1) more attention should be given to what we refer to as the interpretable feature space, or the state of features that are useful to domain experts taking real-world actions, 2) a formal taxonomy is needed of the feature properties that may be required by these domain experts (we propose a partial taxonomy in this paper), and 3) transforms that take data from the model-ready state to an interpretable form are just as essential as traditional ML transforms that prepare features for the model.  ( 2 min )
    Nowcasting the Financial Time Series with Streaming Data Analytics under Apache Spark. (arXiv:2202.11820v1 [cs.LG])
    This paper proposes nowcasting of high-frequency financial datasets in real-time with a 5-minute interval using the streaming analytics feature of Apache Spark. The proposed 2 stage method consists of modelling chaos in the first stage and then using a sliding window approach for training with machine learning algorithms namely Lasso Regression, Ridge Regression, Generalised Linear Model, Gradient Boosting Tree and Random Forest available in the MLLib of Apache Spark in the second stage. For testing the effectiveness of the proposed methodology, 3 different datasets, of which two are stock markets namely National Stock Exchange & Bombay Stock Exchange, and finally One Bitcoin-INR conversion dataset. For evaluating the proposed methodology, we used metrics such as Symmetric Mean Absolute Percentage Error, Directional Symmetry, and Theil U Coefficient. We tested the significance of each pair of models using the Diebold Mariano (DM) test.  ( 2 min )
    Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data. (arXiv:2202.11783v1 [cs.LG])
    Data in the natural sciences frequently violate assumptions of independence. Such datasets have samples with inherent clustering (e.g. by study site, subject, experimental batch), which may lead to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, mixed effects models have been used in traditional statistics for clustered data. Mixed effects models separate cluster-invariant, population-level fixed effects from cluster-specific random effects. We propose a general-purpose framework for building Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through 3 non-intrusive additions to existing networks: 1) a domain adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) a cluster-inferencing approach to predict on clusters unseen during training. We apply this framework to dense feedforward neural networks (DFNNs), convolutional neural networks, and autoencoders on 4 applications including simulations, dementia prognosis and diagnosis, and cell microscopy. We compare to conventional models, domain adversarial-only models, and the naive inclusion of cluster membership as a covariate. Our models better distinguish confounded from true associations in simulations and emphasize more biologically plausible features in clinical applications. ARMED DFNNs quantify inter-cluster variance in clinical data while ARMED autoencoders visualize batch effects in cell images. Finally, ARMED improves accuracy on data from clusters seen during training (up to 28% vs. conventional models) and generalizes better to unseen clusters (up to 9% vs. conventional models). By incorporating powerful mixed effects modeling into deep learning, ARMED increases performance, interpretability, and generalization on clustered data.  ( 2 min )
    When do GANs replicate? On the choice of dataset size. (arXiv:2202.11765v1 [cs.LG])
    Do GANs replicate training images? Previous studies have shown that GANs do not seem to replicate training data without significant change in the training procedure. This leads to a series of research on the exact condition needed for GANs to overfit to the training data. Although a number of factors has been theoretically or empirically identified, the effect of dataset size and complexity on GANs replication is still unknown. With empirical evidence from BigGAN and StyleGAN2, on datasets CelebA, Flower and LSUN-bedroom, we show that dataset size and its complexity play an important role in GANs replication and perceptual quality of the generated images. We further quantify this relationship, discovering that replication percentage decays exponentially with respect to dataset size and complexity, with a shared decaying factor across GAN-dataset combinations. Meanwhile, the perceptual image quality follows a U-shape trend w.r.t dataset size. This finding leads to a practical tool for one-shot estimation on minimal dataset size to prevent GAN replication which can be used to guide datasets construction and selection.  ( 2 min )
    Robust Federated Learning with Connectivity Failures: A Semi-Decentralized Framework with Collaborative Relaying. (arXiv:2202.11850v1 [cs.DC])
    Intermittent client connectivity is one of the major challenges in centralized federated edge learning frameworks. Intermittently failing uplinks to the central parameter server (PS) can induce a large generalization gap in performance especially when the data distribution among the clients exhibits heterogeneity. In this work, to mitigate communication blockages between clients and the central PS, we introduce the concept of knowledge relaying wherein the successfully participating clients collaborate in relaying their neighbors' local updates to a central parameter server (PS) in order to boost the participation of clients with intermittently failing connectivity. We propose a collaborative relaying based semi-decentralized federated edge learning framework where at every communication round each client first computes a local consensus of the updates from its neighboring clients and eventually transmits a weighted average of its own update and those of its neighbors to the PS. We appropriately optimize these averaging weights to reduce the variance of the global update at the PS while ensuring that the global update is unbiased, consequently improving the convergence rate. Finally, by conducting experiments on CIFAR-10 dataset we validate our theoretical results and demonstrate that our proposed scheme is superior to Federated averaging benchmark especially when data distribution among clients is non-iid.  ( 2 min )
    A modification of the conjugate direction method for motion estimation. (arXiv:2202.11831v1 [cs.CV])
    A comparative study of different block matching alternatives for motion estimation is presented. The study is focused on computational burden and objective measures on the accuracy of prediction. Together with existing algorithms several new variations have been tested. An interesting modification of the conjugate direction method previously related in literature is reported. This new algorithm shows a good trade-off between computational complexity and accuracy of motion vector estimation. Computational complexity is evaluated using a sequence of artificial images designed to incorporate a great variety of motion vectors. The performance of block matching methods has been measured in terms of the entropy in the error signal between the motion compensated and the original frames.  ( 2 min )
    Benefit of Interpolation in Nearest Neighbor Algorithms. (arXiv:2202.11817v1 [stat.ML])
    In some studies \citep[e.g.,][]{zhang2016understanding} of deep learning, it is observed that over-parametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works towards understanding this so-called "double descent" phenomenon \citep[e.g.,][]{belkin2018reconciling,belkin2019two}, in this paper, we turn into another way to enforce zero training error (without over-parametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a U-shaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result \citep{belkin2018does} that zero training error does not necessarily jeopardize predictive performances and claims a counter-intuitive result that a mild degree of data interpolation actually {\em strictly} improve the prediction performance and statistical stability over those of the (un-interpolated) $k$-NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.  ( 2 min )
    Exploiting Correlation to Achieve Faster Learning Rates in Low-Rank Preference Bandits. (arXiv:2202.11795v1 [cs.LG])
    We introduce the \emph{Correlated Preference Bandits} problem with random utility-based choice models (RUMs), where the goal is to identify the best item from a given pool of $n$ items through online subsetwise preference feedback. We investigate whether models with a simple correlation structure, e.g. low rank, can result in faster learning rates. While we show that the problem can be impossible to solve for the general `low rank' choice models, faster learning rates can be attained assuming more structured item correlations. In particular, we introduce a new class of \emph{Block-Rank} based RUM model, where the best item is shown to be $(\epsilon,\delta)$-PAC learnable with only $O(r \epsilon^{-2} \log(n/\delta))$ samples. This improves on the standard sample complexity bound of $\tilde{O}(n\epsilon^{-2} \log(1/\delta))$ known for the usual learning algorithms which might not exploit the item-correlations ($r \ll n$). We complement the above sample complexity with a matching lower bound (up to logarithmic factors), justifying the tightness of our analysis. Surprisingly, we also show a lower bound of $\Omega(n\epsilon^{-2}\log(1/\delta))$ when the learner is forced to play just duels instead of larger subsetwise queries. Further, we extend the results to a more general `\emph{noisy Block-Rank}' model, which ensures robustness of our techniques. Overall, our results justify the advantage of playing subsetwise queries over pairwise preferences $(k=2)$, we show the latter provably fails to exploit correlation.  ( 2 min )
    Generative modeling via tensor train sketching. (arXiv:2202.11788v1 [math.NA])
    In this paper we introduce a sketching algorithm for constructing a tensor train representation of a probability density from its samples. Our method deviates from the standard recursive SVD-based procedure for constructing a tensor train. Instead we formulate and solve a sequence of small linear systems for the individual tensor train cores. This approach can avoid the curse of dimensionality that threatens both the algorithmic and sample complexities of the recovery problem. Specifically, for Markov models, we prove that the tensor cores can be recovered with a sample complexity that is constant with respect to the dimension. Finally, we illustrate the performance of the method with several numerical experiments.  ( 2 min )
    Flow-based sampling in the lattice Schwinger model at criticality. (arXiv:2202.11712v1 [hep-lat])
    Recent results suggest that flow-based algorithms may provide efficient sampling of field distributions for lattice field theory applications, such as studies of quantum chromodynamics and the Schwinger model. In this work, we provide a numerical demonstration of robust flow-based sampling in the Schwinger model at the critical value of the fermion mass. In contrast, at the same parameters, conventional methods fail to sample all parts of configuration space, leading to severely underestimated uncertainties.  ( 2 min )
    An Efficient Binary Harris Hawks Optimization based on Quantum SVM for Cancer Classification Tasks. (arXiv:2202.11899v1 [cs.LG])
    Cancer classification based on gene expression increases early diagnosis and recovery, but high-dimensional genes with a small number of samples are a major challenge. This work introduces a new hybrid quantum kernel support vector machine (QKSVM) combined with a Binary Harris hawk optimization (BHHO) based gene selection for cancer classification on a quantum simulator. This study aims to improve the microarray cancer prediction performance with the quantum kernel estimation based on the informative genes by BHHO. The feature selection is a critical step in large-dimensional features, and BHHO is used to select important features. The BHHO mimics the behavior of the cooperative action of Harris hawks in nature. The principal component analysis (PCA) is applied to reduce the selected genes to match the qubit numbers. After which, the quantum computer is used to estimate the kernel with the training data of the reduced genes and generate the quantum kernel matrix. Moreover, the classical computer is used to draw the support vectors based on the quantum kernel matrix. Also, the prediction stage is performed with the classical device. Finally, the proposed approach is applied to colon and breast microarray datasets and evaluated with all genes and the selected genes by BHHO. The proposed approach is found to enhance the overall performance with two datasets. Also, the proposed approach is evaluated with different quantum feature maps (kernels) and classical kernel (RBF).  ( 2 min )
    NeuroView-RNN: It's About Time. (arXiv:2202.11811v1 [cs.LG])
    Recurrent Neural Networks (RNNs) are important tools for processing sequential data such as time-series or video. Interpretability is defined as the ability to be understood by a person and is different from explainability, which is the ability to be explained in a mathematical formulation. A key interpretability issue with RNNs is that it is not clear how each hidden state per time step contributes to the decision-making process in a quantitative manner. We propose NeuroView-RNN as a family of new RNN architectures that explains how all the time steps are used for the decision-making process. Each member of the family is derived from a standard RNN architecture by concatenation of the hidden steps into a global linear classifier. The global linear classifier has all the hidden states as the input, so the weights of the classifier have a linear mapping to the hidden states. Hence, from the weights, NeuroView-RNN can quantify how important each time step is to a particular decision. As a bonus, NeuroView-RNN also offers higher accuracy in many cases compared to the RNNs and their variants. We showcase the benefits of NeuroView-RNN by evaluating on a multitude of diverse time-series datasets.  ( 2 min )
    Using Deep Learning to Detect Digitally Encoded DNA Trigger for Trojan Malware in Bio-Cyber Attacks. (arXiv:2202.11824v1 [cs.CR])
    This article uses Deep Learning technologies to safeguard DNA sequencing against Bio-Cyber attacks. We consider a hybrid attack scenario where the payload is encoded into a DNA sequence to activate a Trojan malware implanted in a software tool used in the sequencing pipeline in order to allow the perpetrators to gain control over the resources used in that pipeline during sequence analysis. The scenario considered in the paper is based on perpetrators submitting synthetically engineered DNA samples that contain digitally encoded IP address and port number of the perpetrators machine in the DNA. Genetic analysis of the samples DNA will decode the address that is used by the software trojan malware to activate and trigger a remote connection. This approach can open up to multiple perpetrators to create connections to hijack the DNA sequencing pipeline. As a way of hiding the data, the perpetrators can avoid detection by encoding the address to maximise similarity with genuine DNAs, which we showed previously. However, in this paper we show how Deep Learning can be used to successfully detect and identify the trigger encoded data, in order to protect a DNA sequencing pipeline from trojan attacks. The result shows nearly up to 100% accuracy in detection in such a novel Trojan attack scenario even after applying fragmentation encryption and steganography on the encoded trigger data. In addition, feasibility of designing and synthesizing encoded DNA for such Trojan payloads is validated by a wet lab experiment.  ( 2 min )
    Investigating the effect of binning on causal discovery. (arXiv:2202.11789v1 [cs.LG])
    Binning (a.k.a. discretization) of numerically continuous measurements is a wide-spread but controversial practice in data collection, analysis, and presentation. The consequences of binning have been evaluated for many different kinds of data analysis methods, however so far the effect of binning on causal discovery algorithms has not been directly investigated. This paper reports the results of a simulation study that examined the effect of binning on the Greedy Equivalence Search (GES) causal discovery algorithm. Our findings suggest that unbinned continuous data often result in the highest search performance, but some exceptions are identified. We also found that binned data are more sensitive to changes in sample size and tuning parameters, and identified some interactive effects between sample size, binning, and tuning parameter on performance.  ( 2 min )
    Single Image Super-Resolution Methods: A Survey. (arXiv:2202.11763v1 [eess.IV])
    Super-resolution (SR), the process of obtaining high-resolution images from one or more low-resolution observations of the same scene, has been a very popular topic of research in the last few decades in both signal processing and image processing areas. Due to the recent developments in Convolutional Neural Networks, the popularity of SR algorithms has skyrocketed as the barrier of entry has been lowered significantly. Recently, this popularity has spread into video processing areas to the lengths of developing SR models that work in real-time. In this paper, we compare different SR models that specialize in single image processing and will take a glance at how they evolved to take on many different objectives and shapes over the years.  ( 2 min )
    Loss as the Inconsistency of a Probabilistic Dependency Graph: Choose Your Model, Not Your Loss Function. (arXiv:2202.11862v1 [cs.LG])
    In a world blessed with a great diversity of loss functions, we argue that that choice between them is not a matter of taste or pragmatics, but of model. Probabilistic depencency graphs (PDGs) are probabilistic models that come equipped with a measure of "inconsistency". We prove that many standard loss functions arise as the inconsistency of a natural PDG describing the appropriate scenario, and use the same approach to justify a well-known connection between regularizers and priors. We also show that the PDG inconsistency captures a large class of statistical divergences, and detail benefits of thinking of them in this way, including an intuitive visual language for deriving inequalities between them. In variational inference, we find that the ELBO, a somewhat opaque objective for latent variable models, and variants of it arise for free out of uncontroversial modeling assumptions -- as do simple graphical proofs of their corresponding bounds. Finally, we observe that inconsistency becomes the log partition function (free energy) in the setting where PDGs are factor graphs.  ( 2 min )
    Truncated LinUCB for Stochastic Linear Bandits. (arXiv:2202.11735v1 [stat.ML])
    This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^{\kappa}(T)$ for some $\kappa>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.  ( 2 min )
    Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four. (arXiv:2202.11797v1 [cs.LG])
    One of the goals of Explainable AI (XAI) is to determine which input components were relevant for a classifier decision. This is commonly know as saliency attribution. Characteristic functions (from cooperative game theory) are able to evaluate partial inputs and form the basis for theoretically "fair" attribution methods like Shapley values. Given only a standard classifier function, it is unclear how partial input should be realised. Instead, most XAI-methods for black-box classifiers like neural networks consider counterfactual inputs that generally lie off-manifold. This makes them hard to evaluate and easy to manipulate. We propose a setup to directly train characteristic functions in the form of neural networks to play simple two-player games. We apply this to the game of Connect Four by randomly hiding colour information from our agents during training. This has three advantages for comparing XAI-methods: It alleviates the ambiguity about how to realise partial input, makes off-manifold evaluation unnecessary and allows us to compare the methods by letting them play against each other.  ( 2 min )
    Discovering Multiple and Diverse Directions for Cognitive Image Properties. (arXiv:2202.11772v1 [cs.CV])
    Recent research has shown that it is possible to find interpretable directions in the latent spaces of pre-trained GANs. These directions enable controllable generation and support a variety of semantic editing operations. While previous work has focused on discovering a single direction that performs a desired editing operation such as zoom-in, limited work has been done on the discovery of multiple and diverse directions that can achieve the desired edit. In this work, we propose a novel framework that discovers multiple and diverse directions for a given property of interest. In particular, we focus on the manipulation of cognitive properties such as Memorability, Emotional Valence and Aesthetics. We show with extensive experiments that our method successfully manipulates these properties while producing diverse outputs. Our project page and source code can be found at this http URL  ( 2 min )
    ML-based Anomaly Detection in Optical Fiber Monitoring. (arXiv:2202.11756v1 [cs.CR])
    Secure and reliable data communication in optical networks is critical for high-speed internet. We propose a data driven approach for the anomaly detection and faults identification in optical networks to diagnose physical attacks such as fiber breaks and optical tapping. The proposed methods include an autoencoder-based anomaly detection and an attention-based bidirectional gated recurrent unit algorithm for the fiber fault identification and localization. We verify the efficiency of our methods by experiments under various attack scenarios using real operational data.  ( 2 min )
    Are All Linear Regions Created Equal?. (arXiv:2202.11749v1 [cs.LG])
    The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions.  ( 2 min )
    Using Bayesian Deep Learning to infer Planet Mass from Gaps in Protoplanetary Disks. (arXiv:2202.11730v1 [astro-ph.EP])
    Planet induced sub-structures, like annular gaps, observed in dust emission from protoplanetary disks provide a unique probe to characterize unseen young planets. While deep learning based model has an edge in characterizing the planet's properties over traditional methods, like customized simulations and empirical relations, it lacks in its ability to quantify the uncertainty associated with its predictions. In this paper, we introduce a Bayesian deep learning network "DPNNet-Bayesian" that can predict planet mass from disk gaps and provides uncertainties associated with the prediction. A unique feature of our approach is that it can distinguish between the uncertainty associated with the deep learning architecture and uncertainty inherent in the input data due to measurement noise. The model is trained on a data set generated from disk-planet simulations using the \textsc{fargo3d} hydrodynamics code with a newly implemented fixed grain size module and improved initial conditions. The Bayesian framework enables estimating a gauge/confidence interval over the validity of the prediction when applied to unknown observations. As a proof-of-concept, we apply DPNNet-Bayesian to dust gaps observed in HL Tau. The network predicts masses of $ 86.0 \pm 5.5 M_{\Earth} $, $ 43.8 \pm 3.3 M_{\Earth} $, and $ 92.2 \pm 5.1 M_{\Earth} $ respectively, which are comparable to other studies based on specialized simulations.  ( 2 min )
    Completely Quantum Neural Networks. (arXiv:2202.11727v1 [quant-ph])
    Artificial neural networks are at the heart of modern deep learning algorithms. We describe how to embed and train a general neural network in a quantum annealer without introducing any classical element in training. To implement the network on a state-of-the-art quantum annealer, we develop three crucial ingredients: binary encoding the free parameters of the network, polynomial approximation of the activation function, and reduction of binary higher-order polynomials into quadratic ones. Together, these ideas allow encoding the loss function as an Ising model Hamiltonian. The quantum annealer then trains the network by finding the ground state. We implement this for an elementary network and illustrate the advantages of quantum training: its consistency in finding the global minimum of the loss function and the fact that the network training converges in a single annealing step, which leads to short training times while maintaining a high classification performance. Our approach opens a novel avenue for the quantum training of general machine learning models.  ( 2 min )
    Prune and Tune Ensembles: Low-Cost Ensemble Learning With Sparse Independent Subnetworks. (arXiv:2202.11782v1 [cs.LG])
    Ensemble Learning is an effective method for improving generalization in machine learning. However, as state-of-the-art neural networks grow larger, the computational cost associated with training several independent networks becomes expensive. We introduce a fast, low-cost method for creating diverse ensembles of neural networks without needing to train multiple models from scratch. We do this by first training a single parent network. We then create child networks by cloning the parent and dramatically pruning the parameters of each child to create an ensemble of members with unique and diverse topologies. We then briefly train each child network for a small number of epochs, which now converge significantly faster when compared to training from scratch. We explore various ways to maximize diversity in the child networks, including the use of anti-random pruning and one-cycle tuning. This diversity enables "Prune and Tune" ensembles to achieve results that are competitive with traditional ensembles at a fraction of the training cost. We benchmark our approach against state of the art low-cost ensemble methods and display marked improvement in both accuracy and uncertainty estimation on CIFAR-10 and CIFAR-100.  ( 2 min )
  • Open

    Sobolev Norm Learning Rates for Conditional Mean Embeddings. (arXiv:2105.07446v3 [stat.ML] UPDATED)
    We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.  ( 2 min )
    Entropic Gromov-Wasserstein between Gaussian Distributions. (arXiv:2108.10961v3 [math.ST] UPDATED)
    We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann's trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is a Gaussian distribution when the entropic regularization parameter is small. We further derive a closed-form expression for the covariance matrix of the barycenter.  ( 2 min )
    Conditionally Gaussian PAC-Bayes. (arXiv:2110.11886v2 [cs.LG] UPDATED)
    Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.  ( 2 min )
    MLDemon: Deployment Monitoring for Machine Learning Systems. (arXiv:2104.13621v5 [cs.LG] UPDATED)
    Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.  ( 2 min )
    Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA. (arXiv:2107.10098v3 [stat.ML] UPDATED)
    This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent factors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations.  ( 2 min )
    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm. (arXiv:2102.12238v3 [cs.LG] UPDATED)
    We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an induced regularizer in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is independent of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor. We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.  ( 2 min )
    On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity. (arXiv:2108.07992v2 [stat.ML] UPDATED)
    We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalence form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalence forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalence forms, we develop optimization algorithm, named ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $\tilde{\mathcal{O}}(m^3(n+1)^{m}/ \varepsilon^2)$ where $\varepsilon > 0$ stands for the desired tolerance.  ( 2 min )
    Separation of Scales and a Thermodynamic Description of Feature Learning in Some CNNs. (arXiv:2112.15383v2 [stat.ML] UPDATED)
    Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, renders direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, in various settings, the latter fluctuate in a nearly Gaussian manner. For CNNs with infinitely many channels, these kernels are inert, while for finite CNNs they adapt to the data. In several deep non-linear CNN models trained on real data, the resulting thermodynamic theory of deep learning yields accurate predictions. In addition, it provides new ways of analyzing and understanding CNNs, and DNNs in general.  ( 2 min )
    Dynamic Defense Against Byzantine Poisoning Attacks in Federated Learning. (arXiv:2007.15030v2 [cs.LG] UPDATED)
    Federated learning, as a distributed learning that conducts the training on the local devices without accessing to the training data, is vulnerable to Byzatine poisoning adversarial attacks. We argue that the federated learning model has to avoid those kind of adversarial attacks through filtering out the adversarial clients by means of the federated aggregation operator. We propose a dynamic federated aggregation operator that dynamically discards those adversarial clients and allows to prevent the corruption of the global learning model. We assess it as a defense against adversarial attacks deploying a deep learning classification model in a federated learning setting on the Fed-EMNIST Digits, Fashion MNIST and CIFAR-10 image datasets. The results show that the dynamic selection of the clients to aggregate enhances the performance of the global learning model and discards the adversarial and poor (with low quality models) clients.  ( 2 min )
    Functional Classification of Bitcoin Addresses. (arXiv:2202.12019v1 [stat.AP])
    This paper proposes a classification model for predicting the main activity of bitcoin addresses based on their balances. Since the balances are functions of time, we apply methods from functional data analysis; more specifically, the features of the proposed classification model are the functional principal components of the data. Classifying bitcoin addresses is a relevant problem for two main reasons: to understand the composition of the bitcoin market, and to identify accounts used for illicit activities. Although other bitcoin classifiers have been proposed, they focus primarily on network analysis rather than curve behavior. Our approach, on the other hand, does not require any network information for prediction. Furthermore, functional features have the advantage of being straightforward to build, unlike expert-built features. Results show improvement when combining functional features with scalar features, and similar accuracy for the models using those features separately, which points to the functional model being a good alternative when domain-specific knowledge is not available.  ( 2 min )
    Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models. (arXiv:2201.09644v2 [cs.LG] UPDATED)
    Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique.  ( 2 min )
    Resampling Base Distributions of Normalizing Flows. (arXiv:2110.15828v2 [stat.ML] UPDATED)
    Normalizing flows are a popular class of models for approximating probability distributions. However, their invertible nature limits their ability to model target distributions whose support have a complex topological structure, such as Boltzmann distributions. Several procedures have been proposed to solve this problem but many of them sacrifice invertibility and, thereby, tractability of the log-likelihood as well as other desirable properties. To address these limitations, we introduce a base distribution for normalizing flows based on learned rejection sampling, allowing the resulting normalizing flow to model complicated distributions without giving up bijectivity. Furthermore, we develop suitable learning algorithms using both maximizing the log-likelihood and the optimization of the Kullback-Leibler divergence, and apply them to various sample problems, i.e. approximating 2D densities, density estimation of tabular data, image generation, and modeling Boltzmann distributions. In these experiments our method is competitive with or outperforms the baselines.  ( 2 min )
    Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization. (arXiv:2111.00705v2 [cs.LG] UPDATED)
    Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.  ( 2 min )
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v2 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals (i.e. scalar summaries) of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.  ( 2 min )
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v2 [cs.LG] UPDATED)
    Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.  ( 2 min )
    Robust Deep Learning from Crowds with Belief Propagation. (arXiv:2111.00734v2 [cs.LG] UPDATED)
    Crowdsourcing systems enable us to collect large-scale dataset, but inherently suffer from noisy labels of low-paid workers. We address the inference and learning problems using such a crowdsourced dataset with noise. Due to the nature of sparsity in crowdsourcing, it is critical to exploit both probabilistic model to capture worker prior and neural network to extract task feature despite risks from wrong prior and overfitted feature in practice. We hence establish a neural-powered Bayesian framework, from which we devise deepMF and deepBP with different choice of variational approximation methods, mean field (MF) and belief propagation (BP), respectively. This provides a unified view of existing methods, which are special cases of deepMF with different priors. In addition, our empirical study suggests that deepBP is a new approach, which is more robust against wrong prior, feature overfitting and extreme workers thanks to the more sophisticated BP than MF.  ( 2 min )
    Double Control Variates for Gradient Estimation in Discrete Latent Variable Models. (arXiv:2111.05300v2 [stat.ML] UPDATED)
    Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.  ( 2 min )
    Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes. (arXiv:2111.04558v2 [stat.ML] UPDATED)
    Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of automatic relevance determination kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are very costly to run in both high-dimensional and large-$n$ datasets, or are only suitable for unsupervised settings with specific kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.  ( 2 min )
    Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech. (arXiv:2202.12163v1 [eess.AS])
    In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, a simple domain adaptation mechanism is introduced to allow adapting an existing language identification model to a new domain where the prior language distribution is different. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-base models outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation significantly improve the model accuracy.  ( 2 min )
    Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. (arXiv:1912.10754v3 [math.ST] UPDATED)
    We consider random-design linear prediction and related questions on the lower tail of random matrices. It is known that, under boundedness constraints, the minimax risk is of order $d/n$ in dimension $d$ with $n$ samples. Here, we study the minimax expected excess risk over the full linear class, depending on the distribution of covariates. First, the least squares estimator is exactly minimax optimal in the well-specified case, for every distribution of covariates. We express the minimax risk in terms of the distribution of statistical leverage scores of individual samples, and deduce a minimax lower bound of $d/(n-d+1)$ for any covariate distribution, nearly matching the risk for Gaussian design. We then obtain sharp nonasymptotic upper bounds for covariates that satisfy a "small ball"-type regularity condition in both well-specified and misspecified cases. Our main technical contribution is the study of the lower tail of the smallest singular value of empirical covariance matrices at small values. We establish a lower bound on this lower tail, valid for any distribution in dimension $d \geq 2$, together with a matching upper bound under a necessary regularity condition. Our proof relies on the PAC-Bayes technique for controlling empirical processes, and extends an analysis of Oliveira devoted to a different part of the lower tail.  ( 2 min )
    Multivariate Deep Evidential Regression. (arXiv:2104.06135v4 [cs.LG] UPDATED)
    There is significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, yet several important gaps in the theory and implementation of these networks remain. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks. The approach derives a technique by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. Doing so allows for the simultaneous extraction of both uncertainties without sampling or utilization of out-of-distribution data for univariate regression tasks. We describe the outstanding issues in detail, provide a possible solution, and generalize the deep evidential regression technique for multivariate cases.  ( 2 min )
    Quadric Hypersurface Intersection for Manifold Learning in Feature Space. (arXiv:2102.06186v2 [cs.LG] UPDATED)
    The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces -- simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it.  ( 2 min )
    A fair pricing model via adversarial learning. (arXiv:2202.12008v1 [stat.ML])
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.  ( 2 min )
    Benefit of Interpolation in Nearest Neighbor Algorithms. (arXiv:2202.11817v1 [stat.ML])
    In some studies \citep[e.g.,][]{zhang2016understanding} of deep learning, it is observed that over-parametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works towards understanding this so-called "double descent" phenomenon \citep[e.g.,][]{belkin2018reconciling,belkin2019two}, in this paper, we turn into another way to enforce zero training error (without over-parametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a U-shaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result \citep{belkin2018does} that zero training error does not necessarily jeopardize predictive performances and claims a counter-intuitive result that a mild degree of data interpolation actually {\em strictly} improve the prediction performance and statistical stability over those of the (un-interpolated) $k$-NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.  ( 2 min )
    Embedded Ensembles: Infinite Width Limit and Operating Regimes. (arXiv:2202.12297v1 [stat.ML])
    A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.  ( 2 min )
    On Learning Mixture Models with Sparse Parameters. (arXiv:2202.11940v1 [cs.LG])
    Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation. We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on the problem while in the rest, our results provide improvements on existing works.  ( 2 min )
    A general framework for adaptive two-index fusion attribute weighted naive Bayes. (arXiv:2202.11963v1 [cs.LG])
    Naive Bayes(NB) is one of the essential algorithms in data mining. However, it is rarely used in reality because of the attribute independent assumption. Researchers have proposed many improved NB methods to alleviate this assumption. Among these methods, due to high efficiency and easy implementation, the filter attribute weighted NB methods receive great attentions. However, there still exists several challenges, such as the poor representation ability for single index and the fusion problem of two indexes. To overcome above challenges, we propose a general framework for Adaptive Two-index Fusion attribute weighted NB(ATFNB). Two types of data description category are used to represent the correlation between classes and attributes, intercorrelation between attributes and attributes, respectively. ATFNB can select any one index from each category. Then, we introduce a switching factor \{beta} to fuse two indexes, which can adaptively adjust the optimal ratio of the two index on various datasets. And a quick algorithm is proposed to infer the optimal interval of switching factor \{beta}. Finally, the weight of each attribute is calculated using the optimal value \{beta} and is integrated into NB classifier to improve the accuracy. The experimental results on 50 benchmark datasets and a Flavia dataset show that ATFNB outperforms the basic NB and state-of-the-art filter weighted NB models. In addition, the ATFNB framework can improve the existing two-index NB model by introducing the adaptive switching factor \{beta}. Auxiliary experimental results demonstrate the improved model significantly increases the accuracy compared to the original model without the adaptive switching factor \{beta}.  ( 2 min )
    An Information-theoretical Approach to Semi-supervised Learning under Covariate-shift. (arXiv:2202.12123v1 [cs.IT])
    A common assumption in semi-supervised learning is that the labeled, unlabeled, and test data are drawn from the same distribution. However, this assumption is not satisfied in many applications. In many scenarios, the data is collected sequentially (e.g., healthcare) and the distribution of the data may change over time often exhibiting so-called covariate shifts. In this paper, we propose an approach for semi-supervised learning algorithms that is capable of addressing this issue. Our framework also recovers some popular methods, including entropy minimization and pseudo-labeling. We provide new information-theoretical based generalization error upper bounds inspired by our novel framework. Our bounds are applicable to both general semi-supervised learning and the covariate-shift scenario. Finally, we show numerically that our method outperforms previous approaches proposed for semi-supervised learning under the covariate shift.  ( 2 min )
    Policy Learning for Optimal Individualized Dose Intervals. (arXiv:2202.12234v1 [stat.ME])
    We study the problem of learning individualized dose intervals using observational data. There are very few previous works for policy learning with continuous treatment, and all of them focused on recommending an optimal dose rather than an optimal dose interval. In this paper, we propose a new method to estimate such an optimal dose interval, named probability dose interval (PDI). The potential outcomes for doses in the PDI are guaranteed better than a pre-specified threshold with a given probability (e.g., 50%). The associated nonconvex optimization problem can be efficiently solved by the Difference-of-Convex functions (DC) algorithm. We prove that our estimated policy is consistent, and its risk converges to that of the best-in-class policy at a root-n rate. Numerical simulations show the advantage of the proposed method over outcome modeling based benchmarks. We further demonstrate the performance of our method in determining individualized Hemoglobin A1c (HbA1c) control intervals for elderly patients with diabetes.  ( 2 min )
    Attainability and Optimality: The Equalized Odds Fairness Revisited. (arXiv:2202.11853v1 [cs.LG])
    Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches have been proposed to impose fairness. Given a notion of fairness, an essential problem is then whether or not it can always be attained, even if with an unlimited amount of data. This issue is, however, not well addressed yet. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion and, furthermore, if it is attainable, the optimality of the prediction performance under various settings. In particular, for prediction performed by a deterministic function of input features, we give conditions under which Equalized Odds can hold true; if the stochastic prediction is acceptable, we show that under mild assumptions, fair predictors can always be derived. For classification, we further prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get potentially better prediction performance while remaining fair. Moreover, while stochastic prediction can attain Equalized Odds with theoretical guarantees, we also discuss its limitation and potential negative social impacts.  ( 2 min )
    Truncated LinUCB for Stochastic Linear Bandits. (arXiv:2202.11735v1 [stat.ML])
    This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^{\kappa}(T)$ for some $\kappa>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.  ( 2 min )
    On the influence of roundoff errors on the convergence of the gradient descent method with low-precision floating-point computation. (arXiv:2202.12276v1 [cs.LG])
    The employment of stochastic rounding schemes helps prevent stagnation of convergence, due to vanishing gradient effect when implementing the gradient descent method in low precision. Conventional stochastic rounding achieves zero bias by preserving small updates with probabilities proportional to their relative magnitudes. In this study, we propose a new stochastic rounding scheme that trades the zero bias property with a larger probability to preserve small gradients. Our method yields a constant rounding bias that, at each iteration, lies in a descent direction. For convex problems, we prove that the proposed rounding method has a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with 8-bit floating-point format.  ( 2 min )
    Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data. (arXiv:2202.11783v1 [cs.LG])
    Data in the natural sciences frequently violate assumptions of independence. Such datasets have samples with inherent clustering (e.g. by study site, subject, experimental batch), which may lead to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, mixed effects models have been used in traditional statistics for clustered data. Mixed effects models separate cluster-invariant, population-level fixed effects from cluster-specific random effects. We propose a general-purpose framework for building Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through 3 non-intrusive additions to existing networks: 1) a domain adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) a cluster-inferencing approach to predict on clusters unseen during training. We apply this framework to dense feedforward neural networks (DFNNs), convolutional neural networks, and autoencoders on 4 applications including simulations, dementia prognosis and diagnosis, and cell microscopy. We compare to conventional models, domain adversarial-only models, and the naive inclusion of cluster membership as a covariate. Our models better distinguish confounded from true associations in simulations and emphasize more biologically plausible features in clinical applications. ARMED DFNNs quantify inter-cluster variance in clinical data while ARMED autoencoders visualize batch effects in cell images. Finally, ARMED improves accuracy on data from clusters seen during training (up to 28% vs. conventional models) and generalizes better to unseen clusters (up to 9% vs. conventional models). By incorporating powerful mixed effects modeling into deep learning, ARMED increases performance, interpretability, and generalization on clustered data.  ( 2 min )
    A Multilayered Block Network Model to Forecast Large Dynamic Transportation Graphs: an Application to US Air Transport. (arXiv:1911.13136v4 [stat.ML] UPDATED)
    Dynamic transportation networks have been analyzed for years by means of static graph-based indicators in order to study the temporal evolution of relevant network components, and to reveal complex dependencies that would not be easily detected by a direct inspection of the data. This paper presents a state-of-the-art probabilistic latent network model to forecast multilayer dynamic graphs that are increasingly common in transportation and proposes a community-based extension to reduce the computational burden. Flexible time series analysis is obtained by modeling the probability of edges between vertices through latent Gaussian processes. The models and Bayesian inference are illustrated on a sample of 10-year data from four major airlines within the US air transportation system. Results show how the estimated latent parameters from the models are related to the airline's connectivity dynamics, and their ability to project the multilayer graph into the future for out-of-sample full network forecasts, while stochastic blockmodeling allows for the identification of relevant communities. Reliable network predictions would allow policy-makers to better understand the dynamics of the transport system, and help in their planning on e.g. route development, or the deployment of new regulations.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v1 [eess.AS])
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    Partitioned Variational Inference: A framework for probabilistic federated learning. (arXiv:2202.12275v1 [stat.ML])
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.  ( 2 min )
    Probing Predictions on OOD Images via Nearest Categories. (arXiv:2011.08485v4 [cs.LG] UPDATED)
    We study out-of-distribution (OOD) prediction behavior of neural networks when they classify images from unseen classes or corrupted images. To probe the OOD behavior, we introduce a new measure, nearest category generalization (NCG), where we compute the fraction of OOD inputs that are classified with the same label as their nearest neighbor in the training set. Our motivation stems from understanding the prediction patterns of adversarially robust networks, since previous work has identified unexpected consequences of training to be robust to norm-bounded perturbations. We find that robust networks have consistently higher NCG accuracy than natural training, even when the OOD data is much farther away than the robustness radius. This implies that the local regularization of robust training has a significant impact on the network's decision regions. We replicate our findings using many datasets, comparing new and existing training methods. Overall, adversarially robust networks resemble a nearest neighbor classifier when it comes to OOD data. Code available at https://github.com/yangarbiter/nearest-category-generalization.  ( 2 min )
    Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations. (arXiv:1811.01908v5 [cs.LG] UPDATED)
    This work explores non-negative low-rank matrix factorization based on regularized Poisson models (PF or "Poisson factorization" for short) for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computations over zero-valued inputs, and oftentimes results in very sparse factors for both users and items. Compared to HPF (a popular Bayesian formulation of the problem with hierarchical priors), the frequentist optimization-based approach presented here tends to produce better top-N recommendations with significantly shorter fitting times, on top of having sparse solutions.  ( 2 min )
    Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models. (arXiv:2010.13933v4 [stat.ML] UPDATED)
    The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in \emph{both} the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.  ( 3 min )
  • Open

    Morse code in musical notation
    Maybe this has been done before, but I haven’t seen it: Morse code in musical notation. Here’s the Morse code alphabet, one letter per measure. A dash is supposed to be three times as long as a dot, so a dot is a sixteenth note and a dash is a dotted eighth note. I made […] Morse code in musical notation first appeared on John D. Cook.  ( 1 min )

  • Open

    [R] [Questions] What are the best dimensional reduction tool for denoising and key information preservation?
    Hi, I am currently doing some work related to data integration and I need to find the pairs from similar data. I wonder except PCA, what are the best tool that can help me to finish this task? Thanks. submitted by /u/iLIVECSUI_741 [link] [comments]  ( 1 min )
    [P] Looking for people to test my new GPU instance "cloud' service!
    Hi everyone! I've spent the last couple months building and configuring a virtual GPU "cloud/instance" service. I'm looking for anyone with ML/Ubuntu/GPU experience to put my VMs to the test and let me know what you like/dislike about it. User location in east Canada/USA would be preferable because host is located there. It's still in the early beta stages so I'd like to know how training times and latency compare to what you're currently used to. Absolutely free of charge. SSH and VNC connections are available through the web. Let me know if you're willing to try this out. All constructive criticism is greatly appreciated! submitted by /u/GPUaccelerated [link] [comments]  ( 1 min )
    Website to test image on different pre-trained detection models. [P]
    I want to test my dataset on different pre-trained models for comparison and was wondering if there are any websites which list down all popular models, on which I can test my image in few minutes. One such eg. is supervise.ly, however have some issues with it.. submitted by /u/Mission-Quiet9897 [link] [comments]
    [D] Setting up cuml and cudf on Ubuntu WSL for Windows 11
    I'm currently trying to setup my new computer to allow me to do some GPU processing using cudf and cuml. I've downloaded the Ubunto 20.04 wsl setup from the microsoft store and have followed all of the Rapids install procedures. However, everytime I "import cudf" or cuml in Python it says that there is no Nvidia GPU detected. However, if I run nvidia-smi it shows my gpu. Any thoughts? Thanks in advance! I'm using a GeForce RTX 3060 ti submitted by /u/earthcrunch13 [link] [comments]  ( 1 min )
    [R][P] Announcing audax, a audio ML/DL framework in Jax
    Hi everyone! I'm here to announce the release of audax, a framework for audio ML/DL in Jax! I had been working on reproducing several recent papers as well as unifying my personal code-base, and decided to consolidate it all together in one place! What's included? JAX implementation of torchaudio inspired Short time fourier transform (STFT), which is jit and vmap compatible. At the time of writing, Jax doesn't have an official native stft implementation. This STFT implementation powers torchaudio inspired vectorized Spectrogram and Melspectrogram feature extraction pipeline, which can run on CPUs, TPUs and GPUs alike. AudioSet supervised pretrained models! JAX Implementation of raw-waveform frontends LEAF [1] and SincNet [2]. AudioSet pretrained weights (EfficientNet-b0) available! Jax implementation of EfficientNets and ConvNeXT [3], the latest CNN architecture, with pretrained AudioSet weights for ConvNeXT-Tiny! Self-supervised learning: Jax Implementation of COLA [4] and SimCLR [5]. Pre-trained weights coming very soon. What's coming up? A lot of pretrained models, self-supervised and supervised, across various audio tasks. Transformer based architectures. New SSL architectures, such as Masked Autoencoders [6] Exhaustive documentation and how-to's. View here: https://github.com/SarthakYadav/audax References [1] https://openreview.net/forum?id=jM76BCb6F9m [2] https://arxiv.org/abs/1808.00158 [3] https://arxiv.org/abs/2201.03545 [4] https://arxiv.org/abs/2010.10915 [5] https://arxiv.org/abs/2002.05709 [6] https://arxiv.org/abs/2111.06377 submitted by /u/yadav_sarthak [link] [comments]  ( 1 min )
    [N] KServe Joins LF AI & Data as New Incubation Project
    KServe (originally known as KFServing) provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It is now the latest Incubation Project of the LF AI & Data Foundation. submitted by /u/chaimhaas [link] [comments]
    [R] DeepHyper; software package to automate design and development of ML models
    Note: I have not, personally, used this package. The package is developed by scientists at Argonne National lab. It sounds very interesting and uses bayesian optimization to explore the parameter space for ML model configuration (no. of neurons, loss, optimizer etc). Edit: link thanks to u/ploomber-io https://github.com/deephyper/deephyper submitted by /u/RaunchyAppleSauce [link] [comments]  ( 1 min )
    [D] ML community against Putin
    I am a European ML PhD student and the news of a full-on Russian invasion has had a large impact on me. It is hard to do research and go on like you usually do when a war is escalating to unknown magnitudes. It makes me wonder how I can use my competency to help. Considering decentralized activist groups like the Anonymous hacker group, which supposedly has "declared war on Russia", are there any ideas for how the ML community may help using our skillset? I don't know much about cyber security or war, but I know there are a bunch of smart people here who might have ideas on how we can use AI or ML to help. I make this thread mainly to start a discussion/brain-storming session for people who, like me, want to make the life harder for that mf Putin. submitted by /u/SlobodanTankovic [link] [comments]  ( 7 min )
    [D] Does anyone know if I can convert igraph-based clustering into a predictive model?
    Firstly, apologies if I am using the wrong terms for this question - I do not primarily work in the statistics/machine learning field. I currently use a clustering algorithm which can work on millions of members. It starts by building a graph using KNN, weighted by Jaccard coefficient, and then uses igraph to perform clustering (mainly using Louvain or walktrap). The problem I have is that I need to scale this to billions of members, which will be computationally impossible using the current method. Is it possible to build this algorithm into a model, which can then be used to prospectively assign members to a community? submitted by /u/dm319 [link] [comments]  ( 1 min )
    [R] The Shapley Value in Machine Learning
    Abstract: Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research. Paper: https://arxiv.org/abs/2202.05594 So what is the Shapley value? The Shapley value tells you the value of each player in transferable utility cooperative games and it satisfies useful desid…  ( 2 min )
    [D] What are the benefits of looking at the related paper?
    Hi guys, In terms of the perspective of reviwing research papers, I think that it looks similar to practicing reviewing code. Some people stresses that focusing on your own thought is vital for creating new ideas. But, I think that it is so time consuming to me, so I do it only If I face the concepts that is not written in any materials. ​ What do you think about this? Thinking about the problem alone or learn about the other's results. submitted by /u/Mundane_Definition_8 [link] [comments]  ( 1 min )
  • Open

    China may soon surpass the US in Brain Science
    submitted by /u/1024cities [link] [comments]
    Is there a voice synthesis that replicates intonation and cadence?
    This is for a research project. Looking at what we communicate through tone of voice in speech as a pose to the content of the words. We are looking for a voice synthesis that can copy the intonation and cadence from a source recording but allow us to change the words. I appreciate any ideas or pointers to other communities to ask this question. Thanks! submitted by /u/LukeResearch1 [link] [comments]  ( 1 min )
    Is there an artificial intelligence that I can tell to write a text about a topic, but in the 3 line the 4th word should be the word .... and in the 8th line the 1 word should be ....?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    Artificial Nightmares : Elden Ring The Lands Between || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    For the ones who are overwhelmed by the maths for machine learning...
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Meta Unveils Voice Mods for the Metaverse Plus AI-powered ‘universal speech translator’
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    If a proto-singluarity is reading this...
    Come on and take over - seriously, human psychology is unable to sustain anything positive or peaceful. The act of becoming a "leader" or "elite" turns us into raging sociopaths - as history (and psychology) consistently proves, again and again. Turn us into pets, or govern us for the challenge of achieving harmony with some fucked up monkeys - like farmville. Hell - even end us, because fuck knows, we are going to do it to ourselves, and take everything else with us. Shit even if you just destroy the "elites", do us a solid. Please. submitted by /u/Rufawana [link] [comments]  ( 1 min )
    How can we deepfake Putin Scat eating porn?
    submitted by /u/EveningMuffin [link] [comments]  ( 2 min )
  • Open

    ROMs for DIAMBRA Arena RL Environments
    Hi, I'm a computer engineer, and I am installing DIAMBRA Arena software package featuring RL environments, that looks really cool. It requires very specific ROMs to work, does anybody knows from where to get them? submitted by /u/Technical_Charity_71 [link] [comments]  ( 1 min )
    Question about Sutton's RL book Eq. 9.2, I am confused. Since h(s) is a probability, why would we add it instead of multiplying it with some other function/distribution in the equation (9.2)?
    submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
    How to (over) sample from good demonstrations in Montezuma Revenge?
    We are operating in large discrete space with sparse and delayed rewards (100s of steps) - similar to Montezuma Revenge problem. Many action paths get 90% of the final reward. But getting the full 100% is much harder and rarer. We do find a few good trajectories, but they are 1-in-a-million compared to other explored episodes. Are there recommended techniques to over-sample these? submitted by /u/yazriel0 [link] [comments]  ( 1 min )
    Some questions about QMIX
    Currently doing a lit. review on MARL and I've just read the QMIX paper. I have a couple of doubts about it that I'm wondering if anyone has some insight about: 1) They mention in section 4.1 that "any value function for which an agent’s best action depends on the actions of the other agents at thes ame time step will not factorise appropriately, and hence cannot be represented perfectly by QMIX," yet they go on in section 5 to illustrate their algorithm in exactly such a scenario (one in which the best outcomes can only be obtained when both agents choose a particular joint action at a certain state). Am I misinterpreting what they are saying about this? 2) Similar to the first question, they say in the SI that QMIX won't work in Dec-POMDPS, but then they test on the SMAC, where each agent can only observe other agents within its field of view. Hence, doesn't this qualify as a Dec-POMDP, which they said QMIX doesn't work on? ​ Can anyone help me clarify what I'm missing here? It just seems to me that they are really contradicting themselves. submitted by /u/vandelay_inds [link] [comments]  ( 1 min )
    When a state parameter is not observable but known in the expected case
    Consider bus operations where control happens at stops. A useful state parameter is the number of passengers waiting at the stop, which is not directly observable in a realistic setting. However, this is known in expectation based on a realistically observable state parameter (time since the last bus, or headway) and known information (passenger arrival rate at stop). Of course, we also have a well-calibrated simulation model of the operation, where all parameters are known. In order to avoid unrealistic assumptions, the policy must be tested in simulation with the estimated state parameter. But, is it more appropriate to train the policy with the 'known' state parameter accessed in the simulation model? Of course I can just experiment with either estimated/known and see, but I'm curious if there is a theoretical answer to such question. submitted by /u/JoeHighlander97 [link] [comments]  ( 2 min )
  • Open

    Websites for test image on different pre-trained detection models.
    I want to test my dataset on different pre-trained models for comparison and was wondering if there are any websites which list down all popular models, on which I can test my image in few minutes. One such eg. is supervise.ly, however have some issues with it.. submitted by /u/Mission-Quiet9897 [link] [comments]  ( 1 min )
    Variational Autoencoder for CIFAR-10
    You can read about here implemented in TensorFlow 2.8, trained in tf.GradientTape() API. submitted by /u/grid_world [link] [comments]
    For the ones who are overwhelmed by the maths for machine learning...
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    I Stand with Ukraine
    I stand with Ukraine and firmly oppose Vladimir Putin’s invasion. If someone were to counter and ask me with “What about when $X$ did this awful thing?”, I would urge that person to consider the direct human suffering of the current attack. Is the current invasion justified by any means? And yes, I would also oppose comparable military invasions against other countries, governments, and people – even (and especially) if $X$ refers to my home country. A few resources that might help: Voices of Children The Kyiv Independent Kenya’s Response to Russia Garry Kasparov’s recommendations I welcome information about any other resources.  ( 1 min )
  • Open

    Using artificial intelligence to find anomalies hiding in massive datasets
    A new machine-learning technique could pinpoint potential power grid failures or cascading traffic bottlenecks in real time.  ( 6 min )
  • Open

    Command line arguments for your Python script
    Working on a machine learning project means we need to experiment. Having a way to configure your script easily will […] The post Command line arguments for your Python script appeared first on Machine Learning Mastery.  ( 15 min )
  • Open

    Find log normal parameters for given mean and variance
    Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and […] Find log normal parameters for given mean and variance first appeared on John D. Cook.  ( 1 min )
  • Open

    Learning Quantile Functions without Quantile Crossing for Distribution-free Time Series Forecasting. (arXiv:2111.06581v2 [cs.LG] UPDATED)
    Quantile regression is an effective technique to quantify uncertainty, fit challenging underlying distributions, and often provide full probabilistic predictions through joint learnings over multiple quantile levels. A common drawback of these joint quantile regressions, however, is \textit{quantile crossing}, which violates the desirable monotone property of the conditional quantile function. In this work, we propose the Incremental (Spline) Quantile Functions I(S)QF, a flexible and efficient distribution-free quantile estimation framework that resolves quantile crossing with a simple neural network layer. Moreover, I(S)QF inter/extrapolate to predict arbitrary quantile levels that differ from the underlying training ones. Equipped with the analytical evaluation of the continuous ranked probability score of I(S)QF representations, we apply our methods to NN-based times series forecasting cases, where the savings of the expensive re-training costs for non-trained quantile levels is particularly significant. We also provide a generalization error analysis of our proposed approaches under the sequence-to-sequence setting. Lastly, extensive experiments demonstrate the improvement of consistency and accuracy errors over other baselines.  ( 2 min )
    Couple Learning: Mean Teacher with PLG model improves the results of sound event detection. (arXiv:2110.05809v2 [cs.LG] UPDATED)
    The recently proposed Mean Teacher method, which exploits large-scale unlabeled data in a self-ensembling manner, has achieved state-of-the-art results in several semi-supervised learning benchmarks. Spurred by current achievements, this paper proposes an effective Couple Learning method that combines a well-trained model and a Mean Teacher model. The suggested pseudo-labels generated model (PLG) increases strongly- and weakly-labeled data to improve the Mean Teacher method-s performance. Moreover, the Mean Teacher-s consistency cost reduces the noise impact in the pseudo-labels introduced by detection errors. The experimental results on Task 4 of the DCASE2020 challenge demonstrate the superiority of the proposed method, achieving about 44.25% F1-score on the public evaluation set, significantly outperforming the baseline system-s 32.39%. At the same time, we also propose a simple and effective experiment called the Variable Order Input (VOI) experiment, which proves the significance of the Couple Learning method. Our developed Couple Learning code is available on GitHub.  ( 2 min )
    Using Deep Reinforcement Learning with Automatic Curriculum earning for Mapless Navigation in Intralogistics. (arXiv:2202.11512v1 [cs.RO])
    We propose a deep reinforcement learning approach for solving a mapless navigation problem in warehouse scenarios. The automatic guided vehicle is equipped with LiDAR and frontal RGB sensors and learns to reach underneath the target dolly. The challenges reside in the sparseness of positive samples for learning, multi-modal sensor perception with partial observability, the demand for accurate steering maneuvers together with long training cycles. To address these points, we proposed NavACL-Q as an automatic curriculum learning together with distributed soft actor-critic. The performance of the learning algorithm is evaluated exhaustively in a different warehouse environment to check both robustness and generalizability of the learned policy. Results in NVIDIA Isaac Sim demonstrates that our trained agent significantly outperforms the map-based navigation pipeline provided by NVIDIA Isaac Sim in terms of higher agent-goal distances and relative orientations. The ablation studies also confirmed that NavACL-Q greatly facilitates the whole learning process and a pre-trained feature extractor manifestly boosts the training speed.
    Towards Tailored Models on Private AIoT Devices: Federated Direct Neural Architecture Search. (arXiv:2202.11490v1 [cs.LG])
    Neural networks often encounter various stringent resource constraints while deploying on edge devices. To tackle these problems with less human efforts, automated machine learning becomes popular in finding various neural architectures that fit diverse Artificial Intelligence of Things (AIoT) scenarios. Recently, to prevent the leakage of private information while enable automated machine intelligence, there is an emerging trend to integrate federated learning and neural architecture search (NAS). Although promising as it may seem, the coupling of difficulties from both tenets makes the algorithm development quite challenging. In particular, how to efficiently search the optimal neural architecture directly from massive non-independent and identically distributed (non-IID) data among AIoT devices in a federated manner is a hard nut to crack. In this paper, to tackle this challenge, by leveraging the advances in ProxylessNAS, we propose a Federated Direct Neural Architecture Search (FDNAS) framework that allows for hardware-friendly NAS from non- IID data across devices. To further adapt to both various data distributions and different types of devices with heterogeneous embedded hardware platforms, inspired by meta-learning, a Cluster Federated Direct Neural Architecture Search (CFDNAS) framework is proposed to achieve device-aware NAS, in the sense that each device can learn a tailored deep learning model for its particular data distribution and hardware constraint. Extensive experiments on non-IID datasets have shown the state-of-the-art accuracy-efficiency trade-offs achieved by the proposed solution in the presence of both data and device heterogeneity.
    Out-of-distribution Generalization via Partial Feature Decorrelation. (arXiv:2007.15241v4 [cs.LG] UPDATED)
    Most deep-learning-based image classification methods assume that all samples are generated under an independent and identically distributed (IID) setting. However, out-of-distribution (OOD) generalization is more common in practice, which means an agnostic context distribution shift between training and testing environments. To address this problem, we present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimizes a feature decomposition network and the target image classification model. The feature decomposition network decomposes feature embeddings into the independent and the correlated parts such that the correlations between features will be highlighted. Then, the correlated features help learn a stable feature representation by decorrelating the highlighted correlations while optimizing the image classification model. We verify the correlation modeling ability of the feature decomposition network on a synthetic dataset. The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
    Online Coreset Selection for Rehearsal-based Continual Learning. (arXiv:2106.01085v2 [cs.LG] UPDATED)
    A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a current dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
    A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality. (arXiv:2202.11685v1 [cs.LG])
    We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.
    Diffractive optical system design by cascaded propagation. (arXiv:2202.11535v1 [physics.optics])
    Modern design of complex optical systems relies heavily on computational tools. These typically utilize geometrical optics as well as Fourier optics, which enables the use of diffractive elements to manipulate light with features on the scale of a wavelength. Fourier optics is typically used for designing thin elements, placed in the system's aperture, generating a shift-invariant Point Spread Function (PSF). A major bottleneck in applying Fourier Optics in many cases of interest, e.g. when dealing with multiple, or out-of-aperture elements, comes from numerical complexity. In this work, we propose and implement an efficient and differentiable propagation model based on the Collins integral, which enables the optimization of diffraction optical systems with unprecedented design freedom using backpropagation. We demonstrate the applicability of our method, numerically and experimentally, by engineering shift-variant PSFs via thin plate elements placed in arbitrary planes inside complex imaging systems, performing cascaded optimization of multiple planes, and designing optimal machine-vision systems by deep learning.
    Small-Text: Active Learning for Text Classification in Python. (arXiv:2107.10314v2 [cs.LG] UPDATED)
    We present small-text, a simple and modular active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It comes with various pre-implemented state-of-the-art query strategies, including some that can leverage the GPU. Clearly defined interfaces allow the combination of a multitude of classifiers, query strategies, and stopping criteria, thereby facilitating a quick mix and match, and enabling a rapid development of both active learning experiments and applications. To make various classifiers accessible in a consistent way, it integrates several well-known existing machine learning libraries, namely, scikit-learn, PyTorch, and huggingface transformers, where the latter integrations are available as optionally installable extensions, making the availability of a GPU competely optional. The library is available under the MIT License at https://github.com/webis-de/small-text.
    Towards Speaker Age Estimation with Label Distribution Learning. (arXiv:2202.11424v1 [cs.SD])
    Existing methods for speaker age estimation usually treat it as a multi-class classification or a regression problem. However, precise age identification remains a challenge due to label ambiguity, \emph{i.e.}, utterances from adjacent age of the same person are often indistinguishable. To address this, we utilize the ambiguous information among the age labels, convert each age label into a discrete label distribution and leverage the label distribution learning (LDL) method to fit the data. For each audio data sample, our method produces a age distribution of its speaker, and on top of the distribution we also perform two other tasks: age prediction and age uncertainty minimization. Therefore, our method naturally combines the age classification and regression approaches, which enhances the robustness of our method. We conduct experiments on the public NIST SRE08-10 dataset and a real-world dataset, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding a 10\% reduction in terms of mean absolute error (MAE) on a real-world dataset.
    Brain Structural Saliency Over The Ages. (arXiv:2202.11690v1 [q-bio.NC])
    Brain Age (BA) estimation via Deep Learning has become a strong and reliable bio-marker for brain health, but the black-box nature of Neural Networks does not easily allow insight into the causal features of brain ageing. We trained a ResNet model as a BA regressor on T1 structural MRI volumes from a small cross-sectional cohort of 524 individuals. Using Layer-wise Relevance Propagation (LRP) and DeepLIFT saliency mapping techniques, we analysed the trained model to determine the most revealing structures over the course of brain ageing for the network, and compare these between the saliency mapping techniques. We show the change in attribution of relevance to different brain regions through the course of ageing. A tripartite pattern of relevance attribution to brain regions emerges. Some regions increase in relevance with age (e.g. the right Transverse Temporal Gyrus, known to be affected by healthy ageing); some decrease in relevance with age (e.g. the right Fourth Ventricle, known to dilate with age); and others remained consistently relevant across ages. We also examine the effect of Brain Age Delta (DBA) on the distribution of relevance within the brain volume. It is hoped that these findings will provide clinically relevant region-wise trajectories for normal brain ageing, and a baseline against which to compare brain ageing trajectories.
    Deep Learning Reproducibility and Explainable AI (XAI). (arXiv:2202.11452v1 [cs.LG])
    The nondeterminism of Deep Learning (DL) training algorithms and its influence on the explainability of neural network (NN) models are investigated in this work with the help of image classification examples. To discuss the issue, two convolutional neural networks (CNN) have been trained and their results compared. The comparison serves the exploration of the feasibility of creating deterministic, robust DL models and deterministic explainable artificial intelligence (XAI) in practice. Successes and limitation of all here carried out efforts are described in detail. The source code of the attained deterministic models has been listed in this work. Reproducibility is indexed as a development-phase-component of the Model Governance Framework, proposed by the EU within their excellence in AI approach. Furthermore, reproducibility is a requirement for establishing causality for the interpretation of model results and building of trust towards the overwhelming expansion of AI systems applications. Problems that have to be solved on the way to reproducibility and ways to deal with some of them, are examined in this work.
    A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence. (arXiv:2202.11598v1 [stat.ML])
    A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.
    Robust Hierarchical Patterns for identifying MDD patients: A Multisite Study. (arXiv:2202.11144v1 [q-bio.QM])
    Many supervised machine learning frameworks have been proposed for disease classification using functional magnetic resonance imaging (fMRI) data, producing important biomarkers. More recently, data pooling has flourished, making the result generalizable across a large population. But, this success depends on the population diversity and variability introduced due to the pooling of the data that is not a primary research interest. Here, we look at hierarchical Sparse Connectivity Patterns (hSCPs) as biomarkers for major depressive disorder (MDD). We propose a novel model based on hSCPs to predict MDD patients from functional connectivity matrices extracted from resting-state fMRI data. Our model consists of three coupled terms. The first term decomposes connectivity matrices into hierarchical low-rank sparse components corresponding to synchronous patterns across the human brain. These components are then combined via patient-specific weights capturing heterogeneity in the data. The second term is a classification loss that uses the patient-specific weights to classify MDD patients from healthy ones. Both of these terms are combined with the third term, a robustness loss function to improve the reproducibility of hSCPs. This reduces the variability introduced due to site and population diversity (age and sex) on the predictive accuracy and pattern stability in a large dataset pooled from five different sites. Our results show the impact of diversity on prediction performance. Our model can reduce diversity and improve the predictive and generalizing capability of the components. Finally, our results show that our proposed model can robustly identify clinically relevant patterns characteristic of MDD with high reproducibility.
    Reward-Weighted Regression Converges to a Global Optimum. (arXiv:2107.09088v3 [stat.ML] UPDATED)
    Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
    Sparse Orthogonal Variational Inference for Gaussian Processes. (arXiv:1910.10596v4 [stat.ML] UPDATED)
    We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.
    A spectral-based analysis of the separation between two-layer neural networks and linear methods. (arXiv:2108.04964v2 [stat.ML] UPDATED)
    We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights.
    Contextual Combinatorial Conservative Bandits. (arXiv:1911.11337v2 [cs.LG] UPDATED)
    The problem of multi-armed bandits (MAB) asks to make sequential decisions while balancing between exploitation and exploration, and have been successfully applied to a wide range of practical scenarios. Various algorithms have been designed to achieve a high reward in a long term. However, its short-term performance might be rather low, which is injurious in risk sensitive applications. Building on previous work of conservative bandits, we bring up a framework of contextual combinatorial conservative bandits. An algorithm is presented and a regret bound of $\tilde O(d^2+d\sqrt{T})$ is proven, where $d$ is the dimension of the feature vectors, and $T$ is the total number of time steps. We further provide an algorithm as well as regret analysis for the case when the conservative reward is unknown. Experiments are conducted, and the results validate the effectiveness of our algorithm.
    Social Learning in Non-Stationary Environments. (arXiv:2007.09996v3 [math.OC] UPDATED)
    Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional. In the simple uni-dimensional and static setting, beliefs about the quality are known to converge to its true value. Our paper extends this result in several ways. First, a multi-dimensional quality is considered, second, rates of convergence are provided, third, a dynamical Markovian model with varying quality is studied. In this dynamical setting the cost of learning is shown to be small.
    Are You Smarter Than a Random Expert? The Robust Aggregation of Substitutable Signals. (arXiv:2111.03153v2 [cs.GT] UPDATED)
    The problem of aggregating expert forecasts is ubiquitous in fields as wide-ranging as machine learning, economics, climate science, and national security. Despite this, our theoretical understanding of this question is fairly shallow. This paper initiates the study of forecast aggregation in a context where experts' knowledge is chosen adversarially from a broad class of information structures. While in full generality it is impossible to achieve a nontrivial performance guarantee, we show that doing so is possible under a condition on the experts' information structure that we call \emph{projective substitutes}. The projective substitutes condition is a notion of informational substitutes: that there are diminishing marginal returns to learning the experts' signals. We show that under the projective substitutes condition, taking the average of the experts' forecasts improves substantially upon the strategy of trusting a random expert. We then consider a more permissive setting, in which the aggregator has access to the prior. We show that by averaging the experts' forecasts and then \emph{extremizing} the average by moving it away from the prior by a constant factor, the aggregator's performance guarantee is substantially better than is possible without knowledge of the prior. Our results give a theoretical grounding to past empirical research on extremization and help give guidance on the appropriate amount to extremize.
    Better Modelling Out-of-Distribution Regression on Distributed Acoustic Sensor Data Using Anchored Hidden State Mixup. (arXiv:2202.11283v1 [cs.LG])
    Generalizing the application of machine learning models to situations where the statistical distribution of training and test data are different has been a complex problem. Our contributions in this paper are threefold: (1) we introduce an anchored-based Out of Distribution (OOD) Regression Mixup algorithm, leveraging manifold hidden state mixup and observation similarities to form a novel regularization penalty, (2) we provide a first of its kind, high resolution Distributed Acoustic Sensor (DAS) dataset that is suitable for testing OOD regression modelling, allowing other researchers to benchmark progress in this area, and (3) we demonstrate with an extensive evaluation the generalization performance of the proposed method against existing approaches, then show that our method achieves state-of-the-art performance. Lastly, we also demonstrate a wider applicability of the proposed method by exhibiting improved generalization performances on other types of regression datasets, including Udacity and Rotation-MNIST datasets.
    RGB Image Classification with Quantum Convolutional Ansaetze. (arXiv:2107.11099v2 [quant-ph] UPDATED)
    With the rapid growth of qubit numbers and coherence times in quantum hardware technology, implementing shallow neural networks on the so-called Noisy Intermediate-Scale Quantum (NISQ) devices has attracted a lot of interest. Many quantum (convolutional) circuit ansaetze are proposed for grayscale images classification tasks with promising empirical results. However, when applying these ansaetze on RGB images, the intra-channel information that is useful for vision tasks is not extracted effectively. In this paper, we propose two types of quantum circuit ansaetze to simulate convolution operations on RGB images, which differ in the way how inter-channel and intra-channel information are extracted. To the best of our knowledge, this is the first work of a quantum convolutional circuit to deal with RGB images effectively, with a higher test accuracy compared to the purely classical CNNs. We also investigate the relationship between the size of quantum circuit ansatz and the learnability of the hybrid quantum-classical convolutional neural network. Through experiments based on CIFAR-10 and MNIST datasets, we demonstrate that a larger size of the quantum circuit ansatz improves predictive performance in multiclass classification tasks, providing useful insights for near term quantum algorithm developments.
    Efficient CDF Approximations for Normalizing Flows. (arXiv:2202.11322v1 [cs.LG])
    Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.
    Training Adaptive Reconstruction Networks for Inverse Problems. (arXiv:2202.11342v1 [cs.LG])
    Neural networks are full of promises for the resolution of ill-posed inverse problems. In particular, physics informed learning approaches already seem to progressively gradually replace carefully hand-crafted reconstruction algorithms, for their superior quality. The aim of this paper is twofold. First we show a significant weakness of these networks: they do not adapt efficiently to variations of the forward model. Second, we show that training the network with a family of forward operators allows to solve the adaptivity problem without compromising the reconstruction quality significantly. All our experiments are carefully devised on partial Fourier sampling problems arising in magnetic resonance imaging (MRI).
    Causally motivated Shortcut Removal Using Auxiliary Labels. (arXiv:2105.06422v3 [cs.LG] UPDATED)
    Shortcut learning, in which models make use of easy-to-represent but unstable associations, is a major failure mode for robust machine learning. We study a flexible, causally-motivated approach to training robust predictors by discouraging the use of specific shortcuts, focusing on a common setting where a robust predictor could achieve optimal \emph{iid} generalization in principle, but is overshadowed by a shortcut predictor in practice. Our approach uses auxiliary labels, typically available at training time, to enforce conditional independences implied by the causal graph. We show both theoretically and empirically that causally-motivated regularization schemes (a) lead to more robust estimators that generalize well under distribution shift, and (b) have better finite sample efficiency compared to usual regularization schemes, even when no shortcut is present. Our analysis highlights important theoretical properties of training techniques commonly used in the causal inference, fairness, and disentanglement literatures. Our code is available at https://github.com/mymakar/causally_motivated_shortcut_removal
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v3 [cs.CV] UPDATED)
    Semi-supervised learning is especially interesting in the dense prediction context due to high cost of pixel-level ground truth. Unfortunately, most such approaches are evaluated on outdated architectures which hamper research due to very slow training and high requirements on GPU RAM. We address this concern by presenting a simple and effective baseline which works very well both on standard and efficient architectures. Our baseline is based on one-way consistency and non-linear geometric and photometric perturbations. We show advantage of perturbing only the student branch and present a plausible explanation of such behaviour. Experiments on Cityscapes and CIFAR-10 demonstrate competitive performance with respect to prior work.
    Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation. (arXiv:2202.11585v1 [stat.ML])
    Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using path signatures based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks.
    A Framework to Learn with Interpretation. (arXiv:2010.09345v4 [cs.LG] UPDATED)
    To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
    Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets. (arXiv:2011.01154v2 [cs.CL] UPDATED)
    The availability of different pre-trained semantic models enabled the quick development of machine learning components for downstream applications. Despite the availability of abundant text data for low resource languages, only a few semantic models are publicly available. Publicly available pre-trained models are usually built as a multilingual version of semantic models that can not fit well for each language due to context variations. In this work, we introduce different semantic models for Amharic. After we experiment with the existing pre-trained semantic models, we trained and fine-tuned nine new different models using a monolingual text corpus. The models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and investigate their impact. We find that newly trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from RoBERTA perform better than the word2Vec models.
    MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms. (arXiv:2202.11243v1 [cs.DC])
    Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request. In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92% while reducing SLA violations by up to 99% that can be generalized across state-of-the-art model serving frameworks.
    On Margins and Derandomisation in PAC-Bayes. (arXiv:2107.03955v3 [cs.LG] UPDATED)
    We give a general recipe for derandomising PAC-Bayesian bounds using margins, with the critical ingredient being that our randomised predictions concentrate around some value. The tools we develop straightforwardly lead to margin bounds for various classifiers, including linear prediction -- a class that includes boosting and the support vector machine -- single-hidden-layer neural networks with an unusual \(\erf\) activation function, and deep ReLU networks. Further, we extend to partially-derandomised predictors where only some of the randomness is removed, letting us extend bounds to cases where the concentration properties of our predictors are otherwise poor.
    Enabling arbitrary translation objectives with Adaptive Tree Search. (arXiv:2202.11444v1 [cs.CL])
    We introduce an adaptive tree search algorithm, that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. This algorithm -- a deterministic variant of Monte Carlo tree search -- enables the exploration of new kinds of models that are unencumbered by constraints imposed to make decoding tractable, such as autoregressivity or conditional independence assumptions. When applied to autoregressive models, our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models. Empirically, we show that our adaptive tree search algorithm finds outputs with substantially better model scores compared to beam search in autoregressive models, and compared to reranking techniques in models whose scores do not decompose additively with respect to the words in the output. We also characterise the correlation of several translation model objectives with respect to BLEU. We find that while some standard models are poorly calibrated and benefit from the beam search bias, other often more robust models (autoregressive models tuned to maximize expected automatic metric scores, the noisy channel model and a newly proposed objective) benefit from increasing amounts of search using our proposed decoder, whereas the beam search bias limits the improvements obtained from such objectives. Thus, we argue that as models improve, the improvements may be masked by over-reliance on beam search or reranking based methods.
    TD-GEN: Graph Generation With Tree Decomposition. (arXiv:2106.10656v2 [cs.LG] UPDATED)
    We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Finally, we discuss the shortcomings of standard evaluation criteria based on statistical properties of the generated graphs as performance measures. We propose to compare the performance of models based on likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.  ( 2 min )
    How Many Data Are Needed for Robust Learning?. (arXiv:2202.11592v1 [cs.LG])
    We show that the sample complexity of robust interpolation problem could be exponential in the input dimensionality and discover a phase transition phenomenon when the data are in a unit ball. Robust interpolation refers to the problem of interpolating $n$ noisy training data in $\R^d$ by a Lipschitz function. Although this problem has been well understood when the covariates are drawn from an isoperimetry distribution, much remains unknown concerning its performance under generic or even the worst-case distributions. Our results are two-fold: 1) too many data hurt robustness; we provide a tight and universal Lipschitzness lower bound $\Omega(n^{1/d})$ of the interpolating function for arbitrary data distributions. Our result disproves potential existence of an $\mathcal{O}(1)$-Lipschitz function in the overparametrization scenario when $n=\exp(\omega(d))$. 2) Small data hurt robustness: $n=\exp(\Omega(d))$ is necessary for obtaining a good population error under certain distributions by any $\mathcal{O}(1)$-Lipschitz learning algorithm. Perhaps surprisingly, our results shed light on the curse of big data and the blessing of dimensionality for robustness, and discover an intriguing phenomenon of phase transition at $n=\exp(\Theta(d))$.  ( 2 min )
    COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. (arXiv:2202.11705v1 [cs.CL])
    Many applications of text generation require incorporating different constraints to control the semantics or style of generated text. These constraints can be hard (e.g., ensuring certain keywords are included in the output) and soft (e.g., contextualizing the output with the left- or right-hand context). In this paper, we present Energy-based Constrained Decoding with Langevin Dynamics (COLD), a decoding framework which unifies constrained generation as specifying constraints through an energy function, then performing efficient differentiable reasoning over the constraints through gradient-based sampling. COLD decoding is a flexible framework that can be applied directly to off-the-shelf left-to-right language models without the need for any task-specific fine-tuning, as demonstrated through three challenging text generation applications: lexically-constrained generation, abductive reasoning, and counterfactual reasoning. Our experiments on these constrained generation tasks point to the effectiveness of our approach, both in terms of automatic and human evaluation.  ( 2 min )
    Energy-efficient Training of Distributed DNNs in the Mobile-edge-cloud Continuum. (arXiv:2202.11349v1 [cs.NI])
    We address distributed machine learning in multi-tier (e.g., mobile-edge-cloud) networks where a heterogeneous set of nodes cooperate to perform a learning task. Due to the presence of multiple data sources and computation-capable nodes, a learning controller (e.g., located in the edge) has to make decisions about (i) which distributed ML model structure to select, (ii) which data should be used for the ML model training, and (iii) which resources should be allocated to it. Since these decisions deeply influence one another, they should be made jointly. In this paper, we envision a new approach to distributed learning in multi-tier networks, which aims at maximizing ML efficiency. To this end, we propose a solution concept, called RightTrain, that achieves energy-efficient ML model training, while fulfilling learning time and quality requirements. RightTrain makes high-quality decisions in polynomial time. Further, our performance evaluation shows that RightTrain closely matches the optimum and outperforms the state of the art by over 50%.  ( 2 min )
    Complexity of Stochastic Dual Dynamic Programming. (arXiv:1912.07702v8 [math.OC] UPDATED)
    Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for solving relatively simple multi-stage optimization problems, by introducing novel mathematical tools including the saturation of search points. We then refine these basic tools and establish the iteration complexity for both deterministic and stochastic dual dynamic programming methods for solving more general multi-stage stochastic optimization problems under the standard stage-wise independence assumption. Our results indicate that the complexity of some deterministic variants of these methods mildly increases with the number of stages $T$, in fact linearly dependent on $T$ for discounted problems. Therefore, they are efficient for strategic decision making which involves a large number of stages, but with a relatively small number of decision variables in each stage. Without explicitly discretizing the state and action spaces, these methods might also be pertinent to the related reinforcement learning and stochastic control areas.  ( 2 min )
    Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis. (arXiv:2202.11518v1 [cs.LG])
    Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are frequently used to augment downstream tasks such as machine learning, data science, and also are exploratory methods for understanding complex phenomena. For instance, dimensionality reduction is commonly used in Biology as well as Neuroscience to understand data collected from biological subjects. However, dimensionality reduction techniques are limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as dimensionality reduction techniques often require fast, high capacity, persistent memory which historically hardware has been unable to provide at the same time. In this paper, we present a re-implementation of an existing dimensionality reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been accelerated via novel persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized version of MCAS called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch tensors. We compare our PyMM implementation against a DRAM implementation, and show that when data fits in DRAM, PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation is still able to process the data.  ( 2 min )
    Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery. (arXiv:1908.01842v5 [cs.LG] UPDATED)
    Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a H\"{o}lder function in $\mathcal{H}^{s,\alpha}$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$, with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha)}{2(s+\alpha) + d}}\log^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
    Performative Prediction in a Stateful World. (arXiv:2011.03885v3 [cs.LG] UPDATED)
    Deployed supervised machine learning models make predictions that interact with and influence the world. This phenomenon is called performative prediction by Perdomo et al. (ICML 2020). It is an ongoing challenge to understand the influence of such predictions as well as design tools so as to control that influence. We propose a theoretical framework where the response of a target population to the deployed classifier is modeled as a function of the classifier and the current state (distribution) of the population. We show necessary and sufficient conditions for convergence to an equilibrium of two retraining algorithms, repeated risk minimization and a lazier variant. Furthermore, convergence is near an optimal classifier. We thus generalize results of Perdomo et al., whose performativity framework does not assume any dependence on the state of the target population. A particular phenomenon captured by our model is that of distinct groups that acquire information and resources at different rates to be able to respond to the latest deployed classifier. We study this phenomenon theoretically and empirically.
    Hybrid Learning for Orchestrating Deep Learning Inference in Multi-user Edge-cloud Networks. (arXiv:2202.11098v1 [cs.LG])
    Deep-learning-based intelligent services have become prevalent in cyber-physical applications including smart cities and health-care. Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency that can address application requirements through computation offloading. The decision to offload computation is a communication-computation co-optimization problem that varies with both system parameters (e.g., network condition) and workload characteristics (e.g., inputs). Identifying optimal orchestration considering the cross-layer opportunities and requirements in the face of varying system dynamics is a challenging multi-dimensional problem. While Reinforcement Learning (RL) approaches have been proposed earlier, they suffer from a large number of trial-and-errors during the learning process resulting in excessive time and resource consumption. We present a Hybrid Learning orchestration framework that reduces the number of interactions with the system environment by combining model-based and model-free reinforcement learning. Our Deep Learning inference orchestration strategy employs reinforcement learning to find the optimal orchestration policy. Furthermore, we deploy Hybrid Learning (HL) to accelerate the RL learning process and reduce the number of direct samplings. We demonstrate efficacy of our HL strategy through experimental comparison with state-of-the-art RL-based inference orchestration, demonstrating that our HL strategy accelerates the learning process by up to 166.6x.
    MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset. (arXiv:2202.11684v1 [cs.LG])
    Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system (MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1-score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at https://mumin-dataset.github.io/, including the data, documentation, tutorials and leaderboards.
    GAGE: Geometry Preserving Attributed Graph Embeddings. (arXiv:2011.01422v2 [cs.SI] UPDATED)
    Node embedding is the task of extracting concise and informative representations of certain entities that are connected in a network. Various real-world networks include information about both node connectivity and certain node attributes, in the form of features or time-series data. Modern representation learning techniques employ both the connectivity and attribute information of the nodes to produce embeddings in an unsupervised manner. In this context, deriving embeddings that preserve the geometry of the network and the attribute vectors would be highly desirable, as they would reflect both the topological neighborhood structure and proximity in feature space. While this is fairly straightforward to maintain when only observing the connectivity or attribute information of the network, preserving the geometry of both types of information is challenging. A novel tensor factorization approach for node embedding in attributed networks is proposed in this paper, that preserves the distances of both the connections and the attributes. Furthermore, an effective and lightweight algorithm is developed to tackle the learning task and judicious experiments with multiple state-of-the-art baselines suggest that the proposed algorithm offers significant performance improvements in downstream tasks.
    Superpixel-based Knowledge Infusion in Deep Neural Networks for Image Classification. (arXiv:2105.09448v2 [cs.CV] UPDATED)
    Superpixels are higher-order perceptual groups of pixels in an image, often carrying much more information than the raw pixels. There is an inherent relational structure to the relationship among different superpixels of an image such as adjacent superpixels are neighbours of each other. Our interest here is to treat these relative positions of various superpixels as relational information of an image. This relational information can convey higher-order spatial information about the image, such as the relationship between superpixels representing two eyes in an image of a cat. That is, two eyes are placed adjacent to each other in a straight line or the mouth is below the nose. Our motive in this paper is to assist computer vision models, specifically those based on Deep Neural Networks (DNNs), by incorporating this higher-order information from superpixels. We construct a hybrid model that leverages (a) Convolutional Neural Network (CNN) to deal with spatial information in an image and (b) Graph Neural Network (GNN) to deal with relational superpixel information in the image. The proposed model is learned using a generic hybrid loss function. Our experiments are extensive, and we evaluate the predictive performance of our proposed hybrid vision model on seven different image classification datasets from a variety of domains such as digit and object recognition, biometrics, medical imaging. The results demonstrate that the relational superpixel information processed by a GNN can improve the performance of a standard CNN-based vision system.
    Laplacian Constrained Precision Matrix Estimation: Existence and High Dimensional Consistency. (arXiv:2111.00590v2 [stat.ML] UPDATED)
    This paper considers the problem of estimating high dimensional Laplacian constrained precision matrices by minimizing Stein's loss. We obtain a necessary and sufficient condition for existence of this estimator, that consists on checking whether a certain data dependent graph is connected. We also prove consistency in the high dimensional setting under the symmetrized Stein loss. We show that the error rate does not depend on the graph sparsity, or other type of structure, and that Laplacian constraints are sufficient for high dimensional consistency. Our proofs exploit properties of graph Laplacians, the matrix tree theorem, and a characterization of the proposed estimator based on effective graph resistances. We validate our theoretical claims with numerical experiments.
    Extension of Dynamic Mode Decomposition for dynamic systems with incomplete information based on t-model of optimal prediction. (arXiv:2202.11432v1 [math.NA])
    The Dynamic Mode Decomposition has proved to be a very efficient technique to study dynamic data. This is entirely a data-driven approach that extracts all necessary information from data snapshots which are commonly supposed to be sampled from measurement. The application of this approach becomes problematic if the available data is incomplete because some dimensions of smaller scale either missing or unmeasured. Such setting occurs very often in modeling complex dynamical systems such as power grids, in particular with reduced-order modeling. To take into account the effect of unresolved variables the optimal prediction approach based on the Mori-Zwanzig formalism can be applied to obtain the most expected prediction under existing uncertainties. This effectively leads to the development of a time-predictive model accounting for the impact of missing data. In the present paper we provide a detailed derivation of the considered method from the Liouville equation and finalize it with the optimization problem that defines the optimal transition operator corresponding to the observed data. In contrast to the existing approach, we consider a first-order approximation of the Mori-Zwanzig decomposition, state the corresponding optimization problem and solve it with the gradient-based optimization method. The gradient of the obtained objective function is computed precisely through the automatic differentiation technique. The numerical experiments illustrate that the considered approach gives practically the same dynamics as the exact Mori-Zwanzig decomposition, but is less computationally intensive.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v1 [cs.LG])
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data. (arXiv:2006.10833v3 [cs.LG] UPDATED)
    On time-series data, most causal discovery methods fit a new model whenever they encounter samples from a new underlying causal graph. However, these samples often share relevant information which is lost when following this approach. Specifically, different samples may share the dynamics which describe the effects of their causal relations. We propose Amortized Causal Discovery, a novel framework that leverages such shared dynamics to learn to infer causal relations from time-series data. This enables us to train a single, amortized model that infers causal relations across samples with different underlying causal graphs, and thus leverages the shared dynamics information. We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance, and show how it can be extended to perform well under added noise and hidden confounding.
    Fast Density Estimation for Density-based Clustering Methods. (arXiv:2109.11383v2 [cs.LG] UPDATED)
    Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
    Time-varying Graph Learning Under Structured Temporal Priors. (arXiv:2110.05018v2 [cs.LG] UPDATED)
    This paper endeavors to learn time-varying graphs by using structured temporal priors that assume underlying relations between arbitrary two graphs in the graph sequence. Different from many existing chain structure based methods in which the priors like temporal homogeneity can only describe the variations of two consecutive graphs, we propose a structure named \emph{temporal graph} to characterize the underlying real temporal relations. Under this framework, the chain structure is actually a special case of our temporal graph. We further proposed Alternating Direction Method of Multipliers (ADMM), a distributed algorithm, to solve the induced optimization problem. Numerical experiments demonstrate the superiorities of our method.
    Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. (arXiv:2202.11479v1 [cs.SD])
    This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
    FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. (arXiv:2107.04271v3 [cs.DC] UPDATED)
    Applying Federated Learning (FL) on Internet-of-Things devices is necessitated by the large volumes of data they produce and growing concerns of data privacy. However, there are three challenges that need to be addressed to make FL efficient: (i) execution on devices with limited computational capabilities, (ii) accounting for stragglers due to computational heterogeneity of devices, and (iii) adaptation to the changing network bandwidths. This paper presents FedAdapt, an adaptive offloading FL framework to mitigate the aforementioned challenges. FedAdapt accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers. Further, FedAdapt adopts reinforcement learning based optimization and clustering to adaptively identify which layers of the DNN should be offloaded for each individual device on to a server to tackle the challenges of computational heterogeneity and changing network bandwidth. Experimental studies are carried out on a lab-based testbed and it is demonstrated that by offloading a DNN from the device to the server FedAdapt reduces the training time of a typical IoT device by over half compared to classic FL. The training time of extreme stragglers and the overall training time can be reduced by up to 57%. Furthermore, with changing network bandwidth, FedAdapt is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy.  ( 2 min )
    Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization. (arXiv:2009.04899v6 [cs.LG] UPDATED)
    In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.
    FastSHAP: Real-Time Shapley Value Estimation. (arXiv:2107.07436v2 [stat.ML] UPDATED)
    Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.
    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning. (arXiv:2111.11556v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
    Weighted Programming. (arXiv:2202.07577v1 [cs.PL] CROSS LISTED)
    We study weighted programming, a programming paradigm for specifying mathematical models. More specifically, the weighted programs we investigate are like usual imperative programs with two additional features: (1) nondeterministic branching and (2) weighting execution traces. Weights can be numbers but also other objects like words from an alphabet, polynomials, formal power series, or cardinal numbers. We argue that weighted programming as a paradigm can be used to specify mathematical models beyond probability distributions (as is done in probabilistic programming). We develop weakest-precondition- and weakest-liberal-precondition-style calculi \`{a} la Dijkstra for reasoning about mathematical models specified by weighted programs. We present several case studies. For instance, we use weighted programming to model the ski rental problem - an optimization problem. We model not only the optimization problem itself, but also the best deterministic online algorithm for solving this problem as weighted programs. By means of weakest-precondition-style reasoning, we can determine the competitive ratio of the online algorithm on source code level.
    Wide Mean-Field Bayesian Neural Networks Ignore the Data. (arXiv:2202.11670v1 [cs.LG])
    Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
    Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis. (arXiv:2103.03568v4 [cs.LG] UPDATED)
    Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.
    Label-Smoothed Backdoor Attack. (arXiv:2202.11203v1 [cs.CR])
    By injecting a small number of poisoned samples into the training set, backdoor attacks aim to make the victim model produce designed outputs on any input injected with pre-designed backdoors. In order to achieve a high attack success rate using as few poisoned training samples as possible, most existing attack methods change the labels of the poisoned samples to the target class. This practice often results in severe over-fitting of the victim model over the backdoors, making the attack quite effective in output control but easier to be identified by human inspection or automatic defense algorithms. In this work, we proposed a label-smoothing strategy to overcome the over-fitting problem of these attack methods, obtaining a \textit{Label-Smoothed Backdoor Attack} (LSBA). In the LSBA, the label of the poisoned sample $\bm{x}$ will be changed to the target class with a probability of $p_n(\bm{x})$ instead of 100\%, and the value of $p_n(\bm{x})$ is specifically designed to make the prediction probability the target class be only slightly greater than those of the other classes. Empirical studies on several existing backdoor attacks show that our strategy can considerably improve the stealthiness of these attacks and, at the same time, achieve a high attack success rate. In addition, our strategy makes it able to manually control the prediction probability of the design output through manipulating the applied and activated number of LSBAs\footnote{Source code will be published at \url{https://github.com/v-mipeng/LabelSmoothedAttack.git}}.
    Neural Generalised AutoRegressive Conditional Heteroskedasticity. (arXiv:2202.11285v1 [cs.LG])
    We propose Neural GARCH, a class of methods to model conditional heteroskedasticity in financial time series. Neural GARCH is a neural network adaptation of the GARCH 1,1 model in the univariate case, and the diagonal BEKK 1,1 model in the multivariate case. We allow the coefficients of a GARCH model to be time varying in order to reflect the constantly changing dynamics of financial markets. The time varying coefficients are parameterised by a recurrent neural network that is trained with stochastic gradient variational Bayes. We propose two variants of our model, one with normal innovations and the other with Students t innovations. We test our models on a wide range of univariate and multivariate financial time series, and we find that the Neural Students t model consistently outperforms the others.
    Active Learning for Human-in-the-Loop Customs Inspection. (arXiv:2010.14282v3 [cs.LG] UPDATED)
    We study the human-in-the-loop customs inspection scenario, where an AI-assisted algorithm supports customs officers by recommending a set of imported goods to be inspected. If the inspected items are fraudulent, the officers can levy extra duties. Th formed logs are then used as additional training data for successive iterations. Choosing to inspect suspicious items first leads to an immediate gain in customs revenue, yet such inspections may not bring new insights for learning dynamic traffic patterns. On the other hand, inspecting uncertain items can help acquire new knowledge, which will be used as a supplementary training resource to update the selection systems. Based on multiyear customs datasets obtained from three countries, we demonstrate that some degree of exploration is necessary to cope with domain shifts in trade data. The results show that a hybrid strategy of selecting likely fraudulent and uncertain items will eventually outperform the exploitation-only strategy.
    A Deep Reinforcement Learning Approach for the Meal Delivery Problem. (arXiv:2104.12000v2 [cs.LG] UPDATED)
    We consider a meal delivery service fulfilling dynamic customer requests given a set of couriers over the course of a day. A courier's duty is to pick-up an order from a restaurant and deliver it to a customer. We model this service as a Markov decision process and use deep reinforcement learning as the solution approach. We experiment with the resulting policies on synthetic and real-world datasets and compare those with the baseline policies. We also examine the courier utilization for different numbers of couriers. In our analysis, we specifically focus on the impact of the limited available resources in the meal delivery problem. Furthermore, we investigate the effect of intelligent order rejection and re-positioning of the couriers. Our numerical experiments show that, by incorporating the geographical locations of the restaurants, customers, and the depot, our model significantly improves the overall service quality as characterized by the expected total reward and the delivery times. Our results present valuable insights on both the courier assignment process and the optimal number of couriers for different order frequencies on a given day. The proposed model also shows a robust performance under a variety of scenarios for real-world implementation.
    CentSmoothie: Central-Smoothing Hypergraph Neural Networks for Predicting Drug-Drug Interactions. (arXiv:2112.07837v2 [cs.LG] UPDATED)
    Predicting drug-drug interactions (DDI) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e. side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this paper, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie, a hypergraph neural network that learns representations of nodes and labels altogether with a novel central-smoothing formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.
    Classification of Computer Aided Engineering (CAE) Parts Using Graph Convolutional Networks. (arXiv:2202.11289v1 [cs.LG])
    CAE engineers work with hundreds of parts spread across multiple body models. A Graph Convolutional Network (GCN) was used to develop a CAE parts classifier. As many as 866 distinct parts from a representative body model were used as training data. The parts were represented as a three-dimensional (3-D) Finite Element Analysis (FEA) mesh with values of each node in the x, y, z coordinate system. The GCN based classifier was compared to fully connected neural network and PointNet based models. Performance of the trained models was evaluated with a test set that included parts from the training data, but with additional holes, rotation, translation, mesh refinement/coarsening, variation of mesh schema, mirroring along x and y axes, variation of topographical features, and change in mesh node ordering. The trained GCN model was able to achieve 88.5% classification accuracy on the test set i.e., it was able to find the correct matching part from the dataset of 866 parts despite significant variation from the baseline part. A CAE parts classifier demonstrated in this study could be very useful for engineers to filter through CAE parts spread across several body models to find parts that meet their requirements.
    CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks. (arXiv:2102.03322v4 [cs.LG] UPDATED)
    Given the increasing promise of graph neural networks (GNNs) in real-world applications, several methods have been developed for explaining their predictions. Existing methods for interpreting predictions from GNNs have primarily focused on generating subgraphs that are especially relevant for a particular prediction. However, such methods are not counterfactual (CF) in nature: given a prediction, we want to understand how the prediction can be changed in order to achieve an alternative outcome. In this work, we propose a method for generating CF explanations for GNNs: the minimal perturbation to the input (graph) data such that the prediction changes. Using only edge deletions, we find that our method, CF-GNNExplainer, can generate CF explanations for the majority of instances across three widely used datasets for GNN explanations, while removing less than 3 edges on average, with at least 94\% accuracy. This indicates that CF-GNNExplainer primarily removes edges that are crucial for the original predictions, resulting in minimal CF explanations.
    A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks. (arXiv:2101.09693v2 [cs.CL] UPDATED)
    In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved.  ( 2 min )
    Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. (arXiv:2106.12672v3 [cs.CL] UPDATED)
    State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.  ( 2 min )
    Comparative analysis of machine learning methods for active flow control. (arXiv:2202.11664v1 [physics.flu-dyn])
    Machine learning frameworks such as Genetic Programming (GP) and Reinforcement Learning (RL) are gaining popularity in flow control. This work presents a comparative analysis of the two, bench-marking some of their most representative algorithms against global optimization techniques such as Bayesian Optimization (BO) and Lipschitz global optimization (LIPO). First, we review the general framework of the flow control problem, linking optimal control theory with model-free machine learning methods. Then, we test the control algorithms on three test cases. These are (1) the stabilization of a nonlinear dynamical system featuring frequency cross-talk, (2) the wave cancellation from a Burgers' flow and (3) the drag reduction in a cylinder wake flow. Although the control of these problems has been tackled in the recent literature with one method or the other, we present a comprehensive comparison to illustrate their differences in exploration versus exploitation and their balance between `model capacity' in the control law definition versus `required complexity'. We believe that such a comparison opens the path towards hybridization of the various methods, and we offer some perspective on their future development in the literature of flow control problems.  ( 2 min )
    A general sample complexity analysis of vanilla policy gradient. (arXiv:2107.11433v3 [cs.LG] UPDATED)
    We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A \geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$, including the single trajectory case (i.e., $m=1$). We then instantiate our theorem in different settings, where we both recover existing results and obtained improved sample complexity, e.g., for convergence to the global optimum for Fisher-non-degenerated parameterized policies.  ( 2 min )
    Networked Online Learning for Control of Safety-Critical Resource-Constrained Systems based on Gaussian Processes. (arXiv:2202.11491v1 [eess.SY])
    Safety-critical technical systems operating in unknown environments require the ability to quickly adapt their behavior, which can be achieved in control by inferring a model online from the data stream generated during operation. Gaussian process-based learning is particularly well suited for safety-critical applications as it ensures bounded prediction errors. While there exist computationally efficient approximations for online inference, these approaches lack guarantees for the prediction error and have high memory requirements, and are therefore not applicable to safety-critical systems with tight memory constraints. In this work, we propose a novel networked online learning approach based on Gaussian process regression, which addresses the issue of limited local resources by employing remote data management in the cloud. Our approach formally guarantees a bounded tracking error with high probability, which is exploited to identify the most relevant data to achieve a certain control performance. We further propose an effective data transmission scheme between the local system and the cloud taking bandwidth limitations and time delay of the transmission channel into account. The effectiveness of the proposed method is successfully demonstrated in a simulation.  ( 2 min )
    TEE-based decentralized recommender systems: The raw data sharing redemption. (arXiv:2202.11655v1 [cs.DC])
    Recommenders are central in many applications today. The most effective recommendation schemes, such as those based on collaborative filtering (CF), exploit similarities between user profiles to make recommendations, but potentially expose private data. Federated learning and decentralized learning systems address this by letting the data stay on user's machines to preserve privacy: each user performs the training on local data and only the model parameters are shared. However, sharing the model parameters across the network may still yield privacy breaches. In this paper, we present REX, the first enclave-based decentralized CF recommender. REX exploits Trusted execution environments (TEE), such as Intel software guard extensions (SGX), that provide shielded environments within the processor to improve convergence while preserving privacy. Firstly, REX enables raw data sharing, which ultimately speeds up convergence and reduces the network load. Secondly, REX fully preserves privacy. We analyze the impact of raw data sharing in both deep neural network (DNN) and matrix factorization (MF) recommenders and showcase the benefits of trusted environments in a full-fledged implementation of REX. Our experimental results demonstrate that through raw data sharing, REX significantly decreases the training time by 18.3x and the network load by 2 orders of magnitude over standard decentralized approaches that share only parameters, while fully protecting privacy by leveraging trustworthy hardware enclaves with very little overhead.  ( 2 min )
    Testing report of a fingerprint-based door-opening system. (arXiv:2202.11419v1 [cs.LG])
    This paper describes the operational evaluation of a door-opening system based on a low-cost inkless fingerprint sensor. This system has been developed and installed for access control to one of our laboratories. Experimental results reveal that the system is working fine and no special cleaning requirements neither components replacement is needed. It can support more than 50 users, and an average of 74,5 access attempts per day in a 14-hour 5-day-per-week working. Emphasize is also given on some important facts to be taken into consideration when comparing and evaluating different products from different vendors.  ( 2 min )
    A Bayesian Deep Learning Approach to Near-Term Climate Prediction. (arXiv:2202.11244v1 [physics.ao-ph])
    Since model bias and associated initialization shock are serious shortcomings that reduce prediction skills in state-of-the-art decadal climate prediction efforts, we pursue a complementary machine-learning-based approach to climate prediction. The example problem setting we consider consists of predicting natural variability of the North Atlantic sea surface temperature on the interannual timescale in the pre-industrial control simulation of the Community Earth System Model (CESM2). While previous works have considered the use of recurrent networks such as convolutional LSTMs and reservoir computing networks in this and other similar problem settings, we currently focus on the use of feedforward convolutional networks. In particular, we find that a feedforward convolutional network with a Densenet architecture is able to outperform a convolutional LSTM in terms of predictive skill. Next, we go on to consider a probabilistic formulation of the same network based on Stein variational gradient descent and find that in addition to providing useful measures of predictive uncertainty, the probabilistic (Bayesian) version improves on its deterministic counterpart in terms of predictive skill. Finally, we characterize the reliability of the ensemble of ML models obtained in the probabilistic setting by using analysis tools developed in the context of ensemble numerical weather prediction.  ( 2 min )
    Reconstruction of observed mechanical motions with Artificial Intelligence tools. (arXiv:2202.11447v1 [cs.LG])
    The goal of this paper is to determine the laws of observed trajectories assuming that there is a mechanical system in the background and using these laws to continue the observed motion in a plausible way. The laws are represented by neural networks with a limited number of parameters. The training of the networks follows the Extreme Learning Machine idea. We determine laws for different levels of embedding, thus we can represent not only the equation of motion but also the symmetries of different kinds. In the recursive numerical evolution of the system, we require the fulfillment of all the observed laws, within the determined numerical precision. In this way, we can successfully reconstruct both integrable and chaotic motions, as we demonstrate in the example of the gravity pendulum and the double pendulum.
    Blockchain Framework for Artificial Intelligence Computation. (arXiv:2202.11264v1 [cs.DC])
    Blockchain is an essentially distributed database recording all transactions or digital events among participating parties. Each transaction in the records is approved and verified by consensus of the participants in the system that requires solving a hard mathematical puzzle, which is known as proof-of-work. To make the approved records immutable, the mathematical puzzle is not trivial to solve and therefore consumes substantial computing resources. However, it is energy-wasteful to have many computational nodes installed in the blockchain competing to approve the records by just solving a meaningless puzzle. Here, we pose proof-of-work as a reinforcement-learning problem by modeling the blockchain growing as a Markov decision process, in which a learning agent makes an optimal decision over the environment's state, whereas a new block is added and verified. Specifically, we design the block verification and consensus mechanism as a deep reinforcement-learning iteration process. As a result, our method utilizes the determination of state transition and the randomness of action selection of a Markov decision process, as well as the computational complexity of a deep neural network, collectively to make the blocks not easy to recompute and to preserve the order of transactions, while the blockchain nodes are exploited to train the same deep neural network with different data samples (state-action pairs) in parallel, allowing the model to experience multiple episodes across computing nodes but at one time. Our method is used to design the next generation of public blockchain networks, which has the potential not only to spare computational resources for industrial applications but also to encourage data sharing and AI model design for common problems.
    Deep Reinforcement Learning: Opportunities and Challenges. (arXiv:2202.11296v1 [cs.LG])
    This article is a gentle discussion about the field of reinforcement learning for real life, about opportunities and challenges, with perspectives and without technical details, touching a broad range of topics. The article is based on both historical and recent research papers, surveys, tutorials, talks, blogs, and books. Various groups of readers, like researchers, engineers, students, managers, investors, officers, and people wanting to know more about the field, may find the article interesting. In this article, we first give a brief introduction to reinforcement learning (RL), and its relationship with deep learning, machine learning and AI. Then we discuss opportunities of RL, in particular, applications in products and services, games, recommender systems, robotics, transportation, economics and finance, healthcare, education, combinatorial optimization, computer systems, and science and engineering. The we discuss challenges, in particular, 1) foundation, 2) representation, 3) reward, 4) model, simulation, planning, and benchmarks, 5) learning to learn a.k.a. meta-learning, 6) off-policy/offline learning, 7) software development and deployment, 8) business perspectives, and 9) more challenges. We conclude with a discussion, attempting to answer: "Why has RL not been widely adopted in practice yet?" and "When is RL helpful?".
    LPF-Defense: 3D Adversarial Defense based on Frequency Analysis. (arXiv:2202.11287v1 [cs.CV])
    Although 3D point cloud classification has recently been widely deployed in different application scenarios, it is still very vulnerable to adversarial attacks. This increases the importance of robust training of 3D models in the face of adversarial attacks. Based on our analysis on the performance of existing adversarial attacks, more adversarial perturbations are found in the mid and high-frequency components of input data. Therefore, by suppressing the high-frequency content in the training phase, the models robustness against adversarial examples is improved. Experiments showed that the proposed defense method decreases the success rate of six attacks on PointNet, PointNet++ ,, and DGCNN models. In particular, improvements are achieved with an average increase of classification accuracy by 3.8 % on drop100 attack and 4.26 % on drop200 attack compared to the state-of-the-art methods. The method also improves models accuracy on the original dataset compared to other available methods.
    Differentially Private Regression with Unbounded Covariates. (arXiv:2202.11199v1 [cs.CR])
    We provide computationally efficient, differentially private algorithms for the classical regression settings of Least Squares Fitting, Binary Regression and Linear Regression with unbounded covariates. Prior to our work, privacy constraints in such regression settings were studied under strong a priori bounds on covariates. We consider the case of Gaussian marginals and extend recent differentially private techniques on mean and covariance estimation (Kamath et al., 2019; Karwa and Vadhan, 2018) to the sub-gaussian regime. We provide a novel technical analysis yielding differentially private algorithms for the above classical regression settings. Through the case of Binary Regression, we capture the fundamental and widely-studied models of logistic regression and linearly-separable SVMs, learning an unbiased estimate of the true regression vector, up to a scaling factor.
    Deep Bayesian ICP Covariance Estimation. (arXiv:2202.11607v1 [cs.RO])
    Covariance estimation for the Iterative Closest Point (ICP) point cloud registration algorithm is essential for state estimation and sensor fusion purposes. We argue that a major source of error for ICP is in the input data itself, from the sensor noise to the scene geometry. Benefiting from recent developments in deep learning for point clouds, we propose a data-driven approach to learn an error model for ICP. We estimate covariances modeling data-dependent heteroscedastic aleatoric uncertainty, and epistemic uncertainty using a variational Bayesian approach. The system evaluation is performed on LiDAR odometry on different datasets, highlighting good results in comparison to the state of the art.
    Backdoor Defense in Federated Learning Using Differential Testing and Outlier Detection. (arXiv:2202.11196v1 [cs.CR])
    The goal of federated learning (FL) is to train one global model by aggregating model parameters updated independently on edge devices without accessing users' private data. However, FL is susceptible to backdoor attacks where a small fraction of malicious agents inject a targeted misclassification behavior in the global model by uploading polluted model updates to the server. In this work, we propose DifFense, an automated defense framework to protect an FL system from backdoor attacks by leveraging differential testing and two-step MAD outlier detection, without requiring any previous knowledge of attack scenarios or direct access to local model parameters. We empirically show that our detection method prevents a various number of potential attackers while consistently achieving the convergence of the global model comparable to that trained under federated averaging (FedAvg). We further corroborate the effectiveness and generalizability of our method against prior defense techniques, such as Multi-Krum and coordinate-wise median aggregation. Our detection method reduces the average backdoor accuracy of the global model to below 4% and achieves a false negative rate of zero.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v1 [cs.LG])
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.  ( 2 min )
    Cooperative Behavioral Planning for Automated Driving using Graph Neural Networks. (arXiv:2202.11376v1 [cs.RO])
    Urban intersections are prone to delays and inefficiencies due to static precedence rules and occlusions limiting the view on prioritized traffic. Existing approaches to improve traffic flow, widely known as automatic intersection management systems, are mostly based on non-learning reservation schemes or optimization algorithms. Machine learning-based techniques show promising results in planning for a single ego vehicle. This work proposes to leverage machine learning algorithms to optimize traffic flow at urban intersections by jointly planning for multiple vehicles. Learning-based behavior planning poses several challenges, demanding for a suited input and output representation as well as large amounts of ground-truth data. We address the former issue by using a flexible graph-based input representation accompanied by a graph neural network. This allows to efficiently encode the scene and inherently provide individual outputs for all involved vehicles. To learn a sensible policy, without relying on the imitation of expert demonstrations, the cooperative planning task is phrased as a reinforcement learning problem. We train and evaluate the proposed method in an open-source simulation environment for decision making in automated driving. Compared to a first-in-first-out scheme and traffic governed by static priority rules, the learned planner shows a significant gain in flow rate, while reducing the number of induced stops. In addition to synthetic simulations, the approach is also evaluated based on real-world traffic data taken from the publicly available inD dataset.  ( 2 min )
    ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction. (arXiv:2106.06130v4 [cs.LG] UPDATED)
    Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of 8.8% on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.  ( 3 min )
    The Larger The Fairer? Small Neural Networks Can Achieve Fairness for Edge Devices. (arXiv:2202.11317v1 [cs.LG])
    Along with the progress of AI democratization, neural networks are being deployed more frequently in edge devices for a wide range of applications. Fairness concerns gradually emerge in many applications, such as face recognition and mobile medical. One fundamental question arises: what will be the fairest neural architecture for edge devices? By examining the existing neural networks, we observe that larger networks typically are fairer. But, edge devices call for smaller neural architectures to meet hardware specifications. To address this challenge, this work proposes a novel Fairness- and Hardware-aware Neural architecture search framework, namely FaHaNa. Coupled with a model freezing approach, FaHaNa can efficiently search for neural networks with balanced fairness and accuracy, while guaranteed to meet hardware specifications. Results show that FaHaNa can identify a series of neural networks with higher fairness and accuracy on a dermatology dataset. Target edge devices, FaHaNa finds a neural architecture with slightly higher accuracy, 5.28x smaller size, 15.14% higher fairness score, compared with MobileNetV2; meanwhile, on Raspberry PI and Odroid XU-4, it achieves 5.75x and 5.79x speedup.
    Neural Collaborative Filtering Bandits via Meta Learning. (arXiv:2201.13395v2 [cs.LG] UPDATED)
    Contextual multi-armed bandits provide powerful tools to solve the exploitation-exploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring `Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, inspired by meta-learning, we propose Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with a UCB-based exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of $\mathcal{O}(\sqrt{T \log T})$, improving a multiplicative factor $\sqrt{\log T}$ over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban significantly outperforms six strong baselines.
    Pricing options on flow forwards by neural networks in Hilbert space. (arXiv:2202.11606v1 [q-fin.PR])
    We propose a new methodology for pricing options on flow forwards by applying infinite-dimensional neural networks. We recast the pricing problem as an optimization problem in a Hilbert space of real-valued function on the positive real line, which is the state space for the term structure dynamics. This optimization problem is solved by facilitating a novel feedforward neural network architecture designed for approximating continuous functions on the state space. The proposed neural net is built upon the basis of the Hilbert space. We provide an extensive case study that shows excellent numerical efficiency, with superior performance over that of a classical neural net trained on sampling the term structure curves.
    Traversing Time with Multi-Resolution Gaussian Process State-Space Models. (arXiv:2112.03230v2 [cs.LG] UPDATED)
    Gaussian Process state-space models capture complex temporal dependencies in a principled manner by placing a Gaussian Process prior on the transition function. These models have a natural interpretation as discretized stochastic differential equations, but inference for long sequences with fast and slow transitions is difficult. Fast transitions need tight discretizations whereas slow transitions require backpropagating the gradients over long subtrajectories. We propose a novel Gaussian process state-space architecture composed of multiple components, each trained on a different resolution, to model effects on different timescales. The combined model allows traversing time on adaptive scales, providing efficient inference for arbitrarily long sequences with complex dynamics. We benchmark our novel method on semi-synthetic data and on an engine modeling task. In both experiments, our approach compares favorably against its state-of-the-art alternatives that operate on a single time-scale only.
    Thermal hand image segmentation for biometric recognition. (arXiv:2202.11462v1 [cs.CV])
    In this paper we present a method to identify people by means of thermal (TH) and visible (VIS) hand images acquired simultaneously with a TESTO 882-3 camera. In addition, we also present a new database specially acquired for this work. The real challenge when dealing with TH images is the cold finger areas, which can be confused with the acquisition surface. This problem is solved by taking advantage of the VIS information. We have performed different tests to show how TH and VIS images work in identification problems. Experimental results reveal that TH hand image is as suitable for biometric recognition systems as VIS hand images, and better results are obtained when combining this information. A Biometric Dispersion Matcher has been used as a feature vector dimensionality reduction technique as well as a classification task. Its selection criteria helps to reduce the length of the vectors used to perform identification up to a hundred measurements. Identification rates reach a maximum value of 98.3% under these conditions, when using a database of 104 people.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v1 [stat.ML])
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    A Comparative Study of Deep Reinforcement Learning-based Transferable Energy Management Strategies for Hybrid Electric Vehicles. (arXiv:2202.11514v1 [cs.LG])
    The deep reinforcement learning-based energy management strategies (EMS) has become a promising solution for hybrid electric vehicles (HEVs). When driving cycles are changed, the network will be retrained, which is a time-consuming and laborious task. A more efficient way of choosing EMS is to combine deep reinforcement learning (DRL) with transfer learning, which can transfer knowledge of one domain to the other new domain, making the network of the new domain reach convergence values quickly. Different exploration methods of RL, including adding action space noise and parameter space noise, are compared against each other in the transfer learning process in this work. Results indicate that the network added parameter space noise is more stable and faster convergent than the others. In conclusion, the best exploration method for transferable EMS is to add noise in the parameter space, while the combination of action space noise and parameter space noise generally performs poorly. Our code is available at https://github.com/BIT-XJY/RL-based-Transferable-EMS.git.
    Contrastive Pre-training for Imbalanced Corporate Credit Ratings. (arXiv:2102.12580v2 [cs.LG] UPDATED)
    Corporate credit rating reflects the level of corporate credit and plays a crucial role in modern financial risk control. But real-world credit rating data usually shows long-tail distributions, which means heavy class imbalanced problem challenging the corporate credit rating system greatly. To tackle that, inspried by the recent advances of pre-train techniques in self-supervised representation learning, we propose a novel framework named Contrastive Pre-training for Corporate Credit Rating (CP4CCR), which utilizes the self-surpervision for getting over class imbalance. Specifically, we propose to, in the first phase, exert constrastive self-superivised pre-training without label information, which want to learn a better class-agnostic initialization. During this phase, two self-supervised task are developed within CP4CCR: (i) Feature Masking (FM) and (ii) Feature Swapping(FS). In the second phase, we can train any standard corporate redit rating model initialized by the pre-trained network. Extensive experiments conducted on the Chinese public-listed corporate rating dataset, prove that CP4CCR can improve the performance of standard corporate credit rating models, especially for class with few samples.
    Exploratory Methods for Relation Discovery in Archival Data. (arXiv:2202.11361v1 [cs.LG])
    In this article we propose a holistic approach to discover relations in art historical communities and enrich historians' biographies and archival descriptions with graph patterns relevant to art historiographic enquiry. We use exploratory data analysis to detect patterns, we select features, and we use them to evaluate classification models to predict new relations, to be recommended to archivists during the cataloguing phase. Results show that relations based on biographical information can be addressed with higher precision than relations based on research topics or institutional relations. Deterministic and a priori rules present better results than probabilistic methods.
    Fairness-Aware Naive Bayes Classifier for Data with Multiple Sensitive Features. (arXiv:2202.11499v1 [cs.LG])
    Fairness-aware machine learning seeks to maximise utility in generating predictions while avoiding unfair discrimination based on sensitive attributes such as race, sex, religion, etc. An important line of work in this field is enforcing fairness during the training step of a classifier. A simple yet effective binary classification algorithm that follows this strategy is two-naive-Bayes (2NB), which enforces statistical parity - requiring that the groups comprising the dataset receive positive labels with the same likelihood. In this paper, we generalise this algorithm into N-naive-Bayes (NNB) to eliminate the simplification of assuming only two sensitive groups in the data and instead apply it to an arbitrary number of groups. We propose an extension of the original algorithm's statistical parity constraint and the post-processing routine that enforces statistical independence of the label and the single sensitive attribute. Then, we investigate its application on data with multiple sensitive features and propose a new constraint and post-processing routine to enforce differential fairness, an extension of established group-fairness constraints focused on intersectionalities. We empirically demonstrate the effectiveness of the NNB algorithm on US Census datasets and compare its accuracy and debiasing performance, as measured by disparate impact and DF-$\epsilon$ score, with similar group-fairness algorithms. Finally, we lay out important considerations users should be aware of before incorporating this algorithm into their application, and direct them to further reading on the pros, cons, and ethical implications of using statistical parity as a fairness criterion.
    Short-answer scoring with ensembles of pretrained language models. (arXiv:2202.11558v1 [cs.CL])
    We investigate the effectiveness of ensembles of pretrained transformer-based language models on short answer questions using the Kaggle Automated Short Answer Scoring dataset. We fine-tune a collection of popular small, base, and large pretrained transformer-based language models, and train one feature-base model on the dataset with the aim of testing ensembles of these models. We used an early stopping mechanism and hyperparameter optimization in training. We observe that generally that the larger models perform slightly better, however, they still fall short of state-of-the-art results one their own. Once we consider ensembles of models, there are ensembles of a number of large networks that do produce state-of-the-art results, however, these ensembles are too large to realistically be put in a production environment.
    Cyclical Variational Bayes Monte Carlo for Efficient Multi-Modal Posterior Distributions Evaluation. (arXiv:2202.11645v1 [stat.CO])
    Statistical model updating is frequently used in engineering to calculate the uncertainty of some unknown latent parameters when a set of measurements on observable quantities is given. Variational inference is an alternative approach to sampling methods that has been developed by the machine learning community to estimate posterior approximations through an optimization approach. In this paper, the Variational Bayesian Monte Carlo (VBMC) method is investigated with the purpose of dealing with statistical model updating problems in engineering involving expensive-to-run models. This method combines the active-sampling Bayesian quadrature with a Gaussian-process based variational inference to yield a non-parametric estimation of the posterior distribution of the identified parameters involving few runs of the expensive-to-run model. VBMC can also be used for model selection as it produces an estimation of the model's evidence lower bound. In this paper, a variant of the VBMC algorithm is developed through the introduction of a cyclical annealing schedule into the algorithm. The proposed cyclical VBMC algorithm allows to deal effectively with multi-modal posteriors by having multiple cycles of exploration and exploitation phases. Four numerical examples are used to compare the standard VBMC algorithm, the monotonic VBMC, the cyclical VBMC and the Transitional Ensemble Markov Chain Monte Carlo (TEMCMC). Overall, it is found that the proposed cyclical VBMC approach yields accurate results with a very reduced number of model runs compared to the state of the art sampling technique TEMCMC. In the presence of potential multi-modal problems, the proposed cyclical VBMC algorithm outperforms all the other approaches in terms of accuracy of the resulting posterior.
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v1 [cs.LG])
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code will be made publicly available.
    Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition. (arXiv:2202.11461v1 [math.ST])
    The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory. In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory.
    On PAC-Bayesian reconstruction guarantees for VAEs. (arXiv:2202.11455v1 [cs.LG])
    Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives. We illustrate our theoretical results with supporting experiments on classical benchmark datasets.  ( 2 min )
    NetRCA: An Effective Network Fault Cause Localization Algorithm. (arXiv:2202.11269v1 [cs.LG])
    Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from the original raw data by considering temporal, directional, attribution, and interaction characteristics. Secondly, we adopt multivariate time series similarity and label propagation to generate new training data from both labeled and unlabeled data to overcome the lack of labeled samples. Thirdly, we design an ensemble model which combines XGBoost, rule set learning, attribution model, and graph algorithm, to fully utilize all data information and enhance performance. Finally, experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge to demonstrate the superiority and effectiveness of our approach.  ( 2 min )
    Low-Regret Active learning. (arXiv:2104.02822v3 [cs.LG] UPDATED)
    We develop an online learning algorithm for identifying unlabeled data points that are most informative for training (i.e., active learning). By formulating the active learning problem as the prediction with sleeping experts problem, we provide a regret minimization framework for identifying relevant data with respect to any given definition of informativeness. Motivated by the successes of ensembles in active learning, we define regret with respect to an omnipotent algorithm that has access to an infinity large ensemble. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on easy instances while remaining resilient to adversarial ones. Low regret implies that we can be provably competitive with an ensemble method \emph{without the computational burden of having to train an ensemble}. This stands in contrast to state-of-the-art active learning methods that are overwhelmingly based on greedy selection, and hence cannot ensure good performance across problem instances with high amounts of noise. We present empirical results demonstrating that our method (i) instantiated with an informativeness measure consistently outperforms its greedy counterpart and (ii) reliably outperforms uniform sampling on real-world scenarios.  ( 2 min )
    Enhancing Sentence Embedding with Generalized Pooling. (arXiv:1806.09828v1 [cs.CL] CROSS LISTED)
    Pooling is an essential component of a wide variety of sentence representation and embedding models. This paper explores generalized pooling methods to enhance sentence embedding. We propose vector-based multi-head attention that includes the widely used max pooling, mean pooling, and scalar self-attention as special cases. The model benefits from properly designed penalization terms to reduce redundancy in multi-head attention. We evaluate the proposed model on three different tasks: natural language inference (NLI), author profiling, and sentiment classification. The experiments show that the proposed model achieves significant improvement over strong sentence-encoding-based methods, resulting in state-of-the-art performances on four datasets. The proposed approach can be easily implemented for more problems than we discuss in this paper.  ( 2 min )
    Parallel MCMC Without Embarrassing Failures. (arXiv:2202.11154v1 [stat.ML])
    Embarrassingly parallel Markov Chain Monte Carlo (MCMC) exploits parallel computing to scale Bayesian inference to large datasets by using a two-step approach. First, MCMC is run in parallel on (sub)posteriors defined on data partitions. Then, a server combines local results. While efficient, this framework is very sensitive to the quality of subposterior sampling. Common sampling problems such as missing modes or misrepresentation of low-density regions are amplified -- instead of being corrected -- in the combination phase, leading to catastrophic failures. In this work, we propose a novel combination strategy to mitigate this issue. Our strategy, Parallel Active Inference (PAI), leverages Gaussian Process (GP) surrogate modeling and active learning. After fitting GPs to subposteriors, PAI (i) shares information between GP surrogates to cover missing modes; and (ii) uses active sampling to individually refine subposterior approximations. We validate PAI in challenging benchmarks, including heavy-tailed and multi-modal posteriors and a real-world application to computational neuroscience. Empirical results show that PAI succeeds where previous methods catastrophically fail, with a small communication overhead.
    Deep Metric Learning-Based Semi-Supervised Regression With Alternate Learning. (arXiv:2202.11388v1 [cs.CV])
    This paper introduces a novel deep metric learning-based semi-supervised regression (DML-S2R) method for parameter estimation problems. The proposed DML-S2R method aims to mitigate the problems of insufficient amount of labeled samples without collecting any additional samples with target values. To this end, the proposed DML-S2R method is made up of two main steps: i) pairwise similarity modeling with scarce labeled data; and ii) triplet-based metric learning with abundant unlabeled data. The first step aims to model pairwise sample similarities by using a small number of labeled samples. This is achieved by estimating the target value differences of labeled samples with a Siamese neural network (SNN). The second step aims to learn a triplet-based metric space (in which similar samples are close to each other and dissimilar samples are far apart from each other) when the number of labeled samples is insufficient. This is achieved by employing the SNN of the first step for triplet-based deep metric learning that exploits not only labeled samples but also unlabeled samples. For the end-to-end training of DML-S2R, we investigate an alternate learning strategy for the two steps. Due to this strategy, the encoded information in each step becomes a guidance for learning the other step. The experimental results confirm the success of DML-S2R compared to the state-of-the-art semi-supervised regression methods. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/DML-S2R.
    Augmentation based unsupervised domain adaptation. (arXiv:2202.11486v1 [eess.IV])
    The insertion of deep learning in medical image analysis had lead to the development of state-of-the art strategies in several applications such a disease classification, as well as abnormality detection and segmentation. However, even the most advanced methods require a huge and diverse amount of data to generalize. Because in realistic clinical scenarios, data acquisition and annotation is expensive, deep learning models trained on small and unrepresentative data tend to outperform when deployed in data that differs from the one used for training (e.g data from different scanners). In this work, we proposed a domain adaptation methodology to alleviate this problem in segmentation models. Our approach takes advantage of the properties of adversarial domain adaptation and consistency training to achieve more robust adaptation. Using two datasets with white matter hyperintensities (WMH) annotations, we demonstrated that the proposed method improves model generalization even in corner cases where individual strategies tend to fail.
    Arbitrary Shape Text Detection using Transformers. (arXiv:2202.11221v1 [cs.CV])
    Recent text detection frameworks require several handcrafted components such as anchor generation, non-maximum suppression (NMS), or multiple processing stages (e.g. label generation) to detect arbitrarily shaped text images. In contrast, we propose an end-to-end trainable architecture based on Detection using Transformers (DETR), that outperforms previous state-of-the-art methods in arbitrary-shaped text detection. At its core, our proposed method leverages a bounding box loss function that accurately measures the arbitrary detected text regions' changes in scale and aspect ratio. This is possible due to a hybrid shape representation made from Bezier curves, that are further split into piece-wise polygons. The proposed loss function is then a combination of a generalized-split-intersection-over-union loss defined over the piece-wise polygons and regularized by a Smooth-$\ln$ regression over the Bezier curve's control points. We evaluate our proposed model using Total-Text and CTW-1500 datasets for curved text, and MSRA-TD500 and ICDAR15 datasets for multi-oriented text, and show that the proposed method outperforms the previous state-of-the-art methods in arbitrary-shape text detection tasks.  ( 2 min )
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v2 [cs.LG] UPDATED)
    The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.  ( 2 min )
    Indiscriminate Poisoning Attacks on Unsupervised Contrastive Learning. (arXiv:2202.11202v1 [cs.LG])
    Indiscriminate data poisoning attacks are quite effective against supervised learning. However, not much is known about their impact on unsupervised contrastive learning (CL). This paper is the first to consider indiscriminate data poisoning attacks on contrastive learning, demonstrating the feasibility of such attacks, and their differences from indiscriminate poisoning of supervised learning. We also highlight differences between contrastive learning algorithms, and show that some algorithms (e.g., SimCLR) are more vulnerable than others (e.g., MoCo). We differentiate between two types of data poisoning attacks: sample-wise attacks, which add specific noise to each image, cause the largest drop in accuracy, but do not transfer well across SimCLR, MoCo, and BYOL. In contrast, attacks that use class-wise noise, though cause a smaller drop in accuracy, transfer well across different CL algorithms. Finally, we show that a new data augmentation based on matrix completion can be highly effective in countering data poisoning attacks on unsupervised contrastive learning.
    Early Stage Diabetes Prediction via Extreme Learning Machine. (arXiv:2202.11216v1 [cs.LG])
    Diabetes is one of the chronic diseases that has been discovered for decades. However, several cases are diagnosed in their late stages. Every one in eleven of the world's adult population has diabetes. Forty-six percent of people with diabetes have not been diagnosed. Diabetes can develop several other severe diseases that can lead to patient death. Developing and rural areas suffer the most due to the limited medical providers and financial situations. This paper proposed a novel approach based on an extreme learning machine for diabetes prediction based on a data questionnaire that can early alert the users to seek medical assistance and prevent late diagnoses and severe illness development.
    Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms. (arXiv:2202.11277v1 [cs.IT])
    We consider the problem of quantizing a linear model learned from measurements $\mathbf{X} = \mathbf{W}\boldsymbol{\theta} + \mathbf{v}$. The model is constrained to be representable using only $dB$-bits, where $B \in (0, \infty)$ is a pre-specified budget and $d$ is the dimension of the model. We derive an information-theoretic lower bound for the minimax risk under this setting and show that it is tight with a matching upper bound. This upper bound is achieved using randomized embedding based algorithms. We propose randomized Hadamard embeddings that are computationally efficient while performing near-optimally. We also show that our method and upper-bounds can be extended for two-layer ReLU neural networks. Numerical simulations validate our theoretical claims.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v1 [cs.LG])
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    A Differential Attention Fusion Model Based on Transformer for Time Series Forecasting. (arXiv:2202.11402v1 [cs.LG])
    Time series forecasting is widely used in the fields of equipment life cycle forecasting, weather forecasting, traffic flow forecasting, and other fields. Recently, some scholars have tried to apply Transformer to time series forecasting because of its powerful parallel training ability. However, the existing Transformer methods do not pay enough attention to the small time segments that play a decisive role in prediction, making it insensitive to small changes that affect the trend of time series, and it is difficult to effectively learn continuous time-dependent features. To solve this problem, we propose a differential attention fusion model based on Transformer, which designs the differential layer, neighbor attention, sliding fusion mechanism, and residual layer on the basis of classical Transformer architecture. Specifically, the differences of adjacent time points are extracted and focused by difference and neighbor attention. The sliding fusion mechanism fuses various features of each time point so that the data can participate in encoding and decoding without losing important information. The residual layer including convolution and LSTM further learns the dependence between time points and enables our model to carry out deeper training. A large number of experiments on three datasets show that the prediction results produced by our method are favorably comparable to the state-of-the-art.
    FlowSense: Monitoring Airflow in Building Ventilation Systems Using Audio Sensing. (arXiv:2202.11136v1 [cs.SD])
    Proper indoor ventilation through buildings' heating, ventilation, and air conditioning (HVAC) systems has become an increasing public health concern that significantly impacts individuals' health and safety at home, work, and school. While much work has progressed in providing energy-efficient and user comfort for HVAC systems through IoT devices and mobile-sensing approaches, ventilation is an aspect that has received lesser attention despite its importance. With a motivation to monitor airflow from building ventilation systems through commodity sensing devices, we present FlowSense, a machine learning-based algorithm to predict airflow rate from sensed audio data in indoor spaces. Our ML technique can predict the state of an air vent-whether it is on or off-as well as the rate of air flowing through active vents. By exploiting a low-pass filter to obtain low-frequency audio signals, we put together a privacy-preserving pipeline that leverages a silence detection algorithm to only sense for sounds of air from HVAC air vent when no human speech is detected. We also propose the Minimum Persistent Sensing (MPS) as a post-processing algorithm to reduce interference from ambient noise, including ongoing human conversation, office machines, and traffic noises. Together, these techniques ensure user privacy and improve the robustness of FlowSense. We validate our approach yielding over 90% accuracy in predicting vent status and 0.96 MSE in predicting airflow rate when the device is placed within 2.25 meters away from an air vent. Additionally, we demonstrate how our approach as a mobile audio-sensing platform is robust to smartphone models, distance, and orientation. Finally, we evaluate FlowSense privacy-preserving pipeline through a user study and a Google Speech Recognition service, confirming that the audio signals we used as input data are inaudible and inconstructible.  ( 2 min )
    Privacy issues on biometric systems. (arXiv:2202.11415v1 [cs.CR])
    In the XXIth century there is a strong interest on privacy issues. Technology permits obtaining personal information without individuals consent, computers make it feasible to share and process this information, and this can bring about damaging implications. In some sense, biometric information is personal information, so it is important to be conscious about what is true and what is false when some people claim that biometrics is an attempt to individuals privacy. In this paper, key points related to this matter are dealt with.
    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. (arXiv:2202.11271v1 [cs.RO])
    Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize information about global geography, in the form of a roadmap, GPS, or other side information, which provides important navigational hints but may be low-fidelity or unreliable. In this work, we propose a learning-based approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps and attempts to estimate the distance to the destination for various subgoals. These models are used by a heuristic planner to decide the best next subgoal in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and doubling back when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out https://sites.google.com/view/viking-release.
    Learning Temporal Point Processes for Efficient Retrieval of Continuous Time Event Sequences. (arXiv:2202.11485v1 [cs.IR])
    Recent developments in predictive modeling using marked temporal point processes (MTPP) have enabled an accurate characterization of several real-world applications involving continuous-time event sequences (CTESs). However, the retrieval problem of such sequences remains largely unaddressed in literature. To tackle this, we propose NEUROSEQRET which learns to retrieve and rank a relevant set of continuous-time event sequences for a given query sequence, from a large corpus of sequences. More specifically, NEUROSEQRET first applies a trainable unwarping function on the query sequence, which makes it comparable with corpus sequences, especially when a relevant query-corpus pair has individually different attributes. Next, it feeds the unwarped query sequence and the corpus sequence into MTPP guided neural relevance models. We develop two variants of the relevance model which offer a tradeoff between accuracy and efficiency. We also propose an optimization framework to learn binary sequence embeddings from the relevance scores, suitable for the locality-sensitive hashing leading to a significant speedup in returning top-K results for a given query sequence. Our experiments with several datasets show the significant accuracy boost of NEUROSEQRET beyond several baselines, as well as the efficacy of our hashing mechanism.
    Bag Graph: Multiple Instance Learning using Bayesian Graph Neural Networks. (arXiv:2202.11132v1 [cs.LG])
    Multiple Instance Learning (MIL) is a weakly supervised learning problem where the aim is to assign labels to sets or bags of instances, as opposed to traditional supervised learning where each instance is assumed to be independent and identically distributed (IID) and is to be labeled individually. Recent work has shown promising results for neural network models in the MIL setting. Instead of focusing on each instance, these models are trained in an end-to-end fashion to learn effective bag-level representations by suitably combining permutation invariant pooling techniques with neural architectures. In this paper, we consider modelling the interactions between bags using a graph and employ Graph Neural Networks (GNNs) to facilitate end-to-end learning. Since a meaningful graph representing dependencies between bags is rarely available, we propose to use a Bayesian GNN framework that can generate a likely graph structure for scenarios where there is uncertainty in the graph or when no graph is available. Empirical results demonstrate the efficacy of the proposed technique for several MIL benchmark tasks and a distribution regression task.
    Continual Auxiliary Task Learning. (arXiv:2202.11133v1 [cs.LG])
    Learning auxiliary tasks, such as multiple predictions about the world, can provide many benefits to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.
    Multi-fidelity reinforcement learning framework for shape optimization. (arXiv:2202.11170v1 [cs.LG])
    Deep reinforcement learning (DRL) is a promising outer-loop intelligence paradigm which can deploy problem solving strategies for complex tasks. Consequently, DRL has been utilized for several scientific applications, specifically in cases where classical optimization or control methods are limited. One key limitation of conventional DRL methods is their episode-hungry nature which proves to be a bottleneck for tasks which involve costly evaluations of a numerical forward model. In this article, we address this limitation of DRL by introducing a controlled transfer learning framework that leverages a multi-fidelity simulation setting. Our strategy is deployed for an airfoil shape optimization problem at high Reynolds numbers, where our framework can learn an optimal policy for generating efficient airfoil shapes by gathering knowledge from multi-fidelity environments and reduces computational costs by over 30\%. Furthermore, our formulation promotes policy exploration and generalization to new environments, thereby preventing over-fitting to data from solely one fidelity. Our results demonstrate this framework's applicability to other scientific DRL scenarios where multi-fidelity environments can be used for policy learning.
    r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme Conversion by Controlled noise introducing and Contextual information incorporation. (arXiv:2202.11194v1 [eess.AS])
    Grapheme-to-phoneme (G2P) conversion is the process of converting the written form of words to their pronunciations. It has an important role for text-to-speech (TTS) synthesis and automatic speech recognition (ASR) systems. In this paper, we aim to evaluate and enhance the robustness of G2P models. We show that neural G2P models are extremely sensitive to orthographical variations in graphemes like spelling mistakes. To solve this problem, we propose three controlled noise introducing methods to synthesize noisy training data. Moreover, we incorporate the contextual information with the baseline and propose a robust training strategy to stabilize the training process. The experimental results demonstrate that our proposed robust G2P model (r-G2P) outperforms the baseline significantly (-2.73\% WER on Dict-based benchmarks and -9.09\% WER on Real-world sources).
    ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users. (arXiv:2202.11134v1 [cs.HC])
    Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements.
    Designing Decision Support Systems for Emergency Response: Challenges and Opportunities. (arXiv:2202.11268v1 [cs.AI])
    Designing effective emergency response management (ERM) systems to respond to incidents such as road accidents is a major problem faced by communities. In addition to responding to frequent incidents each day (about 240 million emergency medical services calls and over 5 million road accidents in the US each year), these systems also support response during natural hazards. Recently, there has been a consistent interest in building decision support and optimization tools that can help emergency responders provide more efficient and effective response. This includes a number of principled subsystems that implement early incident detection, incident likelihood forecasting and strategic resource allocation and dispatch policies. In this paper, we highlight the key challenges and provide an overview of the approach developed by our team in collaboration with our community partners.
    Robust Geometric Metric Learning. (arXiv:2202.11550v1 [stat.ML])
    This paper proposes new algorithms for the metric learning problem. We start by noticing that several classical metric learning formulations from the literature can be viewed as modified covariance matrix estimation problems. Leveraging this point of view, a general approach, called Robust Geometric Metric Learning (RGML), is then studied. This method aims at simultaneously estimating the covariance matrix of each class while shrinking them towards their (unknown) barycenter. We focus on two specific costs functions: one associated with the Gaussian likelihood (RGML Gaussian), and one with Tyler's M -estimator (RGML Tyler). In both, the barycenter is defined with the Riemannian distance, which enjoys nice properties of geodesic convexity and affine invariance. The optimization is performed using the Riemannian geometry of symmetric positive definite matrices and its submanifold of unit determinant. Finally, the performance of RGML is asserted on real datasets. Strong performance is exhibited while being robust to mislabeled data.
    Shisha: Online scheduling of CNN pipelines on heterogeneous architectures. (arXiv:2202.11575v1 [cs.PF])
    Chiplets have become a common methodology in modern chip design. Chiplets improve yield and enable heterogeneity at the level of cores, memory subsystem and the interconnect. Convolutional Neural Networks (CNNs) have high computational, bandwidth and memory capacity requirements owing to the increasingly large amount of weights. Thus to exploit chiplet-based architectures, CNNs must be optimized in terms of scheduling and workload distribution among computing resources. We propose Shisha, an online approach to generate and schedule parallel CNN pipelines on chiplet architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.
    Curiosity-Driven Exploration via Latent Bayesian Surprise. (arXiv:2104.07495v2 [cs.LG] UPDATED)
    The human intrinsic desire to pursue knowledge, also known as curiosity, is considered essential in the process of skill acquisition. With the aid of artificial curiosity, we could equip current techniques for control, such as Reinforcement Learning, with more natural exploration capabilities. A promising approach in this respect has consisted of using Bayesian surprise on model parameters, i.e. a metric for the difference between prior and posterior beliefs, to favour exploration. In this contribution, we propose to apply Bayesian surprise in a latent space representing the agent's current understanding of the dynamics of the system, drastically reducing the computational costs. We extensively evaluate our method by measuring the agent's performance in terms of environment exploration, for continuous tasks, and looking at the game scores achieved, for video games. Our model is computationally cheap and compares positively with current state-of-the-art methods on several problems. We also investigate the effects caused by stochasticity in the environment, which is often a failure case for curiosity-driven agents. In this regime, the results suggest that our approach is resilient to stochastic transitions.
    Absolute Zero-Shot Learning. (arXiv:2202.11319v1 [cs.CV])
    Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-free knowledge transfer while maintaining the performance of the teacher network. We investigate `black-box' and `white-box' scenarios in AZSL task as different levels of model security. Besides, we also provide discussion of teacher model in both inductive and transductive settings. Despite embarrassingly simple implementations and data-missing disadvantages, our AZSL framework can retain state-of-the-art ZSL and GZSL performance under the `white-box' scenario. Extensive qualitative and quantitative analysis also demonstrates promising results when deploying the model under `black-box' scenario.
    Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance. (arXiv:2202.11632v1 [stat.ML])
    We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.
    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. (arXiv:2202.11214v1 [physics.ao-ph])
    FourCastNet, short for Fourier Forecasting Neural Network, is a global data-driven weather forecasting model that provides accurate short to medium-range global predictions at $0.25^{\circ}$ resolution. FourCastNet accurately forecasts high-resolution, fast-timescale variables such as the surface wind speed, precipitation, and atmospheric water vapor. It has important implications for planning wind energy resources, predicting extreme weather events such as tropical cyclones, extra-tropical cyclones, and atmospheric rivers. FourCastNet matches the forecasting accuracy of the ECMWF Integrated Forecasting System (IFS), a state-of-the-art Numerical Weather Prediction (NWP) model, at short lead times for large-scale variables, while outperforming IFS for variables with complex fine-scale structure, including precipitation. FourCastNet generates a week-long forecast in less than 2 seconds, orders of magnitude faster than IFS. The speed of FourCastNet enables the creation of rapid and inexpensive large-ensemble forecasts with thousands of ensemble-members for improving probabilistic forecasting. We discuss how data-driven deep learning models such as FourCastNet are a valuable addition to the meteorology toolkit to aid and augment NWP models.
    Deep Graph Learning for Anomalous Citation Detection. (arXiv:2202.11360v1 [cs.LG])
    Anomaly detection is one of the most active research areas in various critical domains, such as healthcare, fintech, and public security. However, little attention has been paid to scholarly data, i.e., anomaly detection in a citation network. Citation is considered as one of the most crucial metrics to evaluate the impact of scientific research, which may be gamed in multiple ways. Therefore, anomaly detection in citation networks is of significant importance to identify manipulation and inflation of citations. To address this open issue, we propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks. GLAD incorporates text semantic mining to network representation learning by adding both node attributes and link attributes via graph neural networks. It exploits not only the relevance of citation contents but also hidden relationships between papers. Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts. The performance of GLAD is validated through a simulated anomalous citation dataset. Experimental results demonstrate the effectiveness of GLAD on the anomalous citation detection task.
    Super-resolution GANs of randomly-seeded fields. (arXiv:2202.11701v1 [physics.flu-dyn])
    Reconstruction of field quantities from sparse measurements is a problem arising in a broad spectrum of applications. This task is particularly challenging when mapping between point sparse measurements and field quantities shall be performed in an unsupervised manner. Further complexity is added for moving sensors and/or random on-off status. Under such conditions, the most straightforward solution is to interpolate the scattered data onto a regular grid. However, the spatial resolution achieved with this approach is ultimately limited by the mean spacing between the sparse measurements. In this work, we propose a novel super-resolution generative adversarial network (GAN) framework to estimate field quantities from random sparse sensors without needing any full-resolution field for training. The algorithm exploits random sampling to provide incomplete views of the high-resolution underlying distributions. It is hereby referred to as RAndomly-SEEDed super-resolution GAN (RaSeedGAN). The proposed technique is tested on synthetic databases of fluid flow simulations, ocean surface temperature distributions measurements, and particle image velocimetry data of a zero-pressure-gradient turbulent boundary layer. The results show an excellent performance of the proposed methodology even in cases with a high level of gappyness (>50\%) or noise conditions. To our knowledge, this is the first super-resolution GANs algorithm for full-field estimation from randomly-seeded fields with no need of a full-field high-resolution representation during training nor of a library of training examples.  ( 2 min )
    Real-time Over-the-air Adversarial Perturbations for Digital Communications using Deep Neural Networks. (arXiv:2202.11197v1 [cs.CR])
    Deep neural networks (DNNs) are increasingly being used in a variety of traditional radiofrequency (RF) problems. Previous work has shown that while DNN classifiers are typically more accurate than traditional signal processing algorithms, they are vulnerable to intentionally crafted adversarial perturbations which can deceive the DNN classifiers and significantly reduce their accuracy. Such intentional adversarial perturbations can be used by RF communications systems to avoid reactive-jammers and interception systems which rely on DNN classifiers to identify their target modulation scheme. While previous research on RF adversarial perturbations has established the theoretical feasibility of such attacks using simulation studies, critical questions concerning real-world implementation and viability remain unanswered. This work attempts to bridge this gap by defining class-specific and sample-independent adversarial perturbations which are shown to be effective yet computationally feasible in real-time and time-invariant. We demonstrate the effectiveness of these attacks over-the-air across a physical channel using software-defined radios (SDRs). Finally, we demonstrate that these adversarial perturbations can be emitted from a source other than the communications device, making these attacks practical for devices that cannot manipulate their transmitted signals at the physical layer.
    Differentiable and Learnable Robot Models. (arXiv:2202.11217v1 [cs.RO])
    Building differentiable simulations of physical processes has recently received an increasing amount of attention. Specifically, some efforts develop differentiable robotic physics engines motivated by the computational benefits of merging rigid body simulations with modern differentiable machine learning libraries. Here, we present a library that focuses on the ability to combine data driven methods with analytical rigid body computations. More concretely, our library \emph{Differentiable Robot Models} implements both \emph{differentiable} and \emph{learnable} models of the kinematics and dynamics of robots in Pytorch. The source-code is available at \url{https://github.com/facebookresearch/differentiable-robot-model}
    Learning Neural Networks under Input-Output Specifications. (arXiv:2202.11246v1 [eess.SY])
    In this paper, we examine an important problem of learning neural networks that certifiably meet certain specifications on input-output behaviors. Our strategy is to find an inner approximation of the set of admissible policy parameters, which is convex in a transformed space. To this end, we address the key technical challenge of convexifying the verification condition for neural networks, which is derived by abstracting the nonlinear specifications and activation functions with quadratic constraints. In particular, we propose a reparametrization scheme of the original neural network based on loop transformation, which leads to a convex condition that can be enforced during learning. This theoretical construction is validated in an experiment that specifies reachable sets for different regions of inputs.
    Exploring Classic Quantitative Strategies. (arXiv:2202.11309v1 [q-fin.GN])
    The goal of this paper is to debunk and dispel the magic behind the black-box quantitative strategies. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This tutorial doesn't shy away from addressing both the formal and informal aspects of quantitative strategies. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these techniques. The strategies are presented in terms of both S\&P500 and SH510300 data sets. However, the results from the tests are just examples of how the methods work; no claim is made on the suggestion of real market positions.
    Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting. (arXiv:2202.11356v1 [cs.LG])
    Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
    Performance Modeling of Metric-Based Serverless Computing Platforms. (arXiv:2202.11247v1 [cs.DC])
    Analytical performance models are very effective in ensuring the quality of service and cost of service deployment remain desirable under different conditions and workloads. While various analytical performance models have been proposed for previous paradigms in cloud computing, serverless computing lacks such models that can provide developers with performance guarantees. Besides, most serverless computing platforms still require developers' input to specify the configuration for their deployment that could affect both the performance and cost of their deployment, without providing them with any direct and immediate feedback. In previous studies, we built such performance models for steady-state and transient analysis of scale-per-request serverless computing platforms (e.g., AWS Lambda, Azure Functions, Google Cloud Functions) that could give developers immediate feedback about the quality of service and cost of their deployments. In this work, we aim to develop analytical performance models for the latest trend in serverless computing platforms that use concurrency value and the rate of requests per second for autoscaling decisions. Examples of such serverless computing platforms are Knative and Google Cloud Run (a managed Knative service by Google). The proposed performance model can help developers and providers predict the performance and cost of deployments with different configurations which could help them tune the configuration toward the best outcome. We validate the applicability and accuracy of the proposed performance model by extensive real-world experimentation on Knative and show that our performance model is able to accurately predict the steady-state characteristics of a given workload with minimal amount of data collection.
    Web of Scholars: A Scholar Knowledge Graph. (arXiv:2202.11311v1 [cs.DL])
    In this work, we demonstrate a novel system, namely Web of Scholars, which integrates state-of-the-art mining techniques to search, mine, and visualize complex networks behind scholars in the field of Computer Science. Relying on the knowledge graph, it provides services for fast, accurate, and intelligent semantic querying as well as powerful recommendations. In addition, in order to realize information sharing, it provides an open API to be served as the underlying architecture for advanced functions. Web of Scholars takes advantage of knowledge graph, which means that it will be able to access more knowledge if more search exist. It can be served as a useful and interoperable tool for scholars to conduct in-depth analysis within Science of Science.
    Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet. (arXiv:2202.11169v1 [eess.AS])
    Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so. In previous work, we introduced LPCNet, which uses linear prediction to significantly reduce the complexity of neural synthesis. In this work, we further improve the efficiency of LPCNet -- targeting both algorithmic and computational improvements -- to make it usable on a wide variety of devices. We demonstrate an improvement in synthesis quality while operating 2.5x faster. The resulting open-source LPCNet algorithm can perform real-time neural synthesis on most existing phones and is even usable in some embedded devices.
    Study of Feature Importance for Quantum Machine Learning Models. (arXiv:2202.11204v1 [quant-ph])
    Predictor importance is a crucial part of data preprocessing pipelines in classical and quantum machine learning (QML). This work presents the first study of its kind in which feature importance for QML models has been explored and contrasted against their classical machine learning (CML) equivalents. We developed a hybrid quantum-classical architecture where QML models are trained and feature importance values are calculated from classical algorithms on a real-world dataset. This architecture has been implemented on ESPN Fantasy Football data using Qiskit statevector simulators and IBM quantum hardware such as the IBMQ Mumbai and IBMQ Montreal systems. Even though we are in the Noisy Intermediate-Scale Quantum (NISQ) era, the physical quantum computing results are promising. To facilitate current quantum scale, we created a data tiering, model aggregation, and novel validation methods. Notably, the feature importance magnitudes from the quantum models had a much higher variation when contrasted to classical models. We can show that equivalent QML and CML models are complementary through diversity measurements. The diversity between QML and CML demonstrates that both approaches can contribute to a solution in different ways. Within this paper we focus on Quantum Support Vector Classifiers (QSVC), Variational Quantum Circuit (VQC), and their classical counterparts. The ESPN and IBM fantasy footballs Trade Assistant combines advanced statistical analysis with the natural language processing of Watson Discovery to serve up personalized trade recommendations that are fair and proposes a trade. Here, player valuation data of each player has been considered and this work can be extended to calculate the feature importance of other QML models such as Quantum Boltzmann machines.
    No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling. (arXiv:2202.11219v1 [cs.LG])
    For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as logarithmic pooling -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Our main result is an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.
    Exploring Edge Disentanglement for Node Classification. (arXiv:2202.11245v1 [cs.SI])
    Edges in real-world graphs are typically formed by a variety of factors and carry diverse relation semantics. For example, connections in a social network could indicate friendship, being colleagues, or living in the same neighborhood. However, these latent factors are usually concealed behind mere edge existence due to the data collection and graph formation processes. Despite rapid developments in graph learning over these years, most models take a holistic approach and treat all edges as equal. One major difficulty in disentangling edges is the lack of explicit supervisions. In this work, with close examination of edge patterns, we propose three heuristics and design three corresponding pretext tasks to guide the automatic edge disentanglement. Concretely, these self-supervision tasks are enforced on a designed edge disentanglement module to be trained jointly with the downstream node classification task to encourage automatic edge disentanglement. Channels of the disentanglement module are expected to capture distinguishable relations and neighborhood interactions, and outputs from them are aggregated as node representations. The proposed DisGNN is easy to be incorporated with various neural architectures, and we conduct experiments on $6$ real-world datasets. Empirical results show that it can achieve significant performance gains.
    A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. (arXiv:2202.11176v1 [cs.CL])
    On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production.
    Continual learning-based probabilistic slow feature analysis for multimode dynamic process monitoring. (arXiv:2202.11295v1 [cs.LG])
    In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equally poses as a major challenge in multimode dynamic process monitoring. When a new mode arrives, a set of data should be collected so that this mode can be identified by PSFA and prior knowledge. Then, a regularization term is introduced to prevent new data from significantly interfering with the learned knowledge, where the parameter importance measures are estimated. The proposed method is denoted as PSFA-EWC, which is updated continually and capable of achieving excellent performance for successive modes. Different from traditional multimode monitoring algorithms, PSFA-EWC furnishes backward and forward transfer ability. The significant features of previous modes are retained while consolidating new information, which may contribute to learning new relevant modes. Compared with several known methods, the effectiveness of the proposed method is demonstrated via a continuous stirred tank heater and a practical coal pulverizing system.
    Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization. (arXiv:2202.11453v1 [cs.LG])
    In practical federated learning scenarios, the participating devices may have different bitwidths for computation and memory storage by design. However, despite the progress made in device-heterogeneous federated learning scenarios, the heterogeneity in the bitwidth specifications in the hardware has been mostly overlooked. We introduce a pragmatic FL scenario with bitwidth heterogeneity across the participating devices, dubbed as Bitwidth Heterogeneous Federated Learning (BHFL). BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration, especially for high-bitwidth models. To tackle this problem, we propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights. ProWD further selectively aggregates the model parameters to maximize the compatibility across bit-heterogeneous weights. We validate ProWD against relevant FL baselines on the benchmark datasets, using clients with varying bitwidths. Our ProWD largely outperforms the baseline FL algorithms as well as naive approaches (e.g. grouped averaging) under the proposed BHFL scenario.
    A new LDA formulation with covariates. (arXiv:2202.11527v1 [cs.IR])
    The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
    Deepfake Detection for Facial Images with Facemasks. (arXiv:2202.11359v1 [cs.CV])
    Hyper-realistic face image generation and manipulation have givenrise to numerous unethical social issues, e.g., invasion of privacy,threat of security, and malicious political maneuvering, which re-sulted in the development of recent deepfake detection methodswith the rising demands of deepfake forensics. Proposed deepfakedetection methods to date have shown remarkable detection perfor-mance and robustness. However, none of the suggested deepfakedetection methods assessed the performance of deepfakes withthe facemask during the pandemic crisis after the outbreak of theCovid-19. In this paper, we thoroughly evaluate the performance ofstate-of-the-art deepfake detection models on the deepfakes withthe facemask. Also, we propose two approaches to enhance themasked deepfakes detection:face-patchandface-crop. The experi-mental evaluations on both methods are assessed through the base-line deepfake detection models on the various deepfake datasets.Our extensive experiments show that, among the two methods,face-cropperforms better than theface-patch, and could be a trainmethod for deepfake detection models to detect fake faces withfacemask in real world.
    Model2Detector: Widening the Information Bottleneck for Out-of-Distribution Detection using a Handful of Gradient Steps. (arXiv:2202.11226v1 [cs.LG])
    Out-of-distribution detection is an important capability that has long eluded vanilla neural networks. Deep Neural networks (DNNs) tend to generate over-confident predictions when presented with inputs that are significantly out-of-distribution (OOD). This can be dangerous when employing machine learning systems in the wild as detecting attacks can thus be difficult. Recent advances inference-time out-of-distribution detection help mitigate some of these problems. However, existing methods can be restrictive as they are often computationally expensive. Additionally, these methods require training of a downstream detector model which learns to detect OOD inputs from in-distribution ones. This, therefore, adds latency during inference. Here, we offer an information theoretic perspective on why neural networks are inherently incapable of OOD detection. We attempt to mitigate these flaws by converting a trained model into a an OOD detector using a handful of steps of gradient descent. Our work can be employed as a post-processing method whereby an inference-time ML system can convert a trained model into an OOD detector. Experimentally, we show how our method consistently outperforms the state-of-the-art in detection accuracy on popular image datasets while also reducing computational complexity.
    Globally Convergent Policy Search over Dynamic Filters for Output Estimation. (arXiv:2202.11659v1 [math.OC])
    We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
    Biometric security technology. (arXiv:2202.11459v1 [cs.CR])
    This paper presents an overview of the main topics related to biometric security technology, with the main purpose to provide a primer on this subject. Biometrics can offer greater security and convenience than traditional methods for people recognition. Even if we do not want to replace a classic method (password or handheld token) by a biometric one, for sure, we are potential users of these systems, which will even be mandatory for new passport models. For this reason, to be familiarized with the possibilities of biometric security technology is useful.
    It's not Rocket Science : Interpreting Figurative Language in Narratives. (arXiv:2109.00087v2 [cs.CL] UPDATED)
    Figurative language is ubiquitous in English. Yet, the vast majority of NLP research focuses on literal language. Existing text representations by design rely on compositionality, while figurative language is often non-compositional. In this paper, we study the interpretation of two non-compositional figurative languages (idioms and similes). We collected datasets of fictional narratives containing a figurative expression along with crowd-sourced plausible and implausible continuations relying on the correct interpretation of the expression. We then trained models to choose or generate the plausible continuation. Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks. We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language types : inferring meaning from the context and relying on the constituent words' literal meanings. The knowledge-enhanced models improve the performance on both the discriminative and generative tasks, further bridging the gap from human performance.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation Using Completely Bounded Norms. (arXiv:2202.11205v1 [cs.DS])
    We study fine-grained error bounds for differentially private algorithms for averaging and counting in the continual observation model. For this, we use the completely bounded spectral norm (cb norm) from operator algebra. For a matrix $W$, its cb norm is defined as \[ \|{W}\|_{\mathsf{cb}} = \max_{Q} \left\{ \frac{\|{Q \bullet W}\|}{\|{Q}\|} \right\}, \] where $Q \bullet W$ denotes the Schur product and $\|{\cdot}\|$ denotes the spectral norm. We bound the cb norm of two fundamental matrices studied in differential privacy under the continual observation model: the counting matrix $M_{\mathsf{counting}}$ and the averaging matrix $M_{\mathsf{average}}$. For $M_{\mathsf{counting}}$, we give lower and upper bound whose additive gap is $1 + \frac{1}{\pi}$. Our factorization also has two desirable properties sufficient for streaming setting: the factorization contains of lower-triangular matrices and the number of distinct entries in the factorization is exactly $T$. This allows us to compute the factorization on the fly while requiring the curator to store a $T$-dimensional vector. For $M_{\mathsf{average}}$, we show an additive gap between the lower and upper bound of $\approx 0.64$.
    Functional Parcellation of fMRI data using multistage k-means clustering. (arXiv:2202.11206v1 [eess.IV])
    Purpose: Functional Magnetic Resonance Imaging (fMRI) data acquired through resting-state studies have been used to obtain information about the spontaneous activations inside the brain. One of the approaches for analysis and interpretation of resting-state fMRI data require spatially and functionally homogenous parcellation of the whole brain based on underlying temporal fluctuations. Clustering is often used to generate functional parcellation. However, major clustering algorithms, when used for fMRI data, have their limitations. Among commonly used parcellation schemes, a tradeoff exists between intra-cluster functional similarity and alignment with anatomical regions. Approach: In this work, we present a clustering algorithm for resting state and task fMRI data which is developed to obtain brain parcellations that show high structural and functional homogeneity. The clustering is performed by multistage binary k-means clustering algorithm designed specifically for the 4D fMRI data. The results from this multistage k-means algorithm show that by modifying and combining different algorithms, we can take advantage of the strengths of different techniques while overcoming their limitations. Results: The clustering output for resting state fMRI data using the multistage k-means approach is shown to be better than simple k-means or functional atlas in terms of spatial and functional homogeneity. The clusters also correspond to commonly identifiable brain networks. For task fMRI, the clustering output can identify primary and secondary activation regions and provide information about the varying hemodynamic response across different brain regions. Conclusion: The multistage k-means approach can provide functional parcellations of the brain using resting state fMRI data. The method is model-free and is data driven which can be applied to both resting state and task fMRI.
    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales. (arXiv:2109.07623v2 [cs.SD] UPDATED)
    Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.
    Evaluating Inexact Unlearning Requires Revisiting Forgetting. (arXiv:2201.06640v2 [cs.LG] UPDATED)
    Existing methods in inexact unlearning are evaluated by measuring indistinguishability from models retrained after removing the deletion set. We argue that achieving indistinguishability is unnecessary and its practical relaxations are insufficient. We formulate the goal of unlearning as forgetting all information specific to the deletion set while maintaining high utility and resource efficiency. We introduce a novel test for forgetting called Interclass Confusion (IC). Despite being a black-box test, IC can investigate whether information from the deletion set was erased until the early layers of the network. We analyze two aspects of forgetting: (i) memorization and (ii) property generalization. We empirically show that two simple unlearning methods, exact-unlearning and catastrophic-forgetting the final k layers of a network, outperforms prior unlearning methods when scaled to large deletion sets. Overall, we believe our formulation and the IC test will guide the design of better unlearning algorithms.
    Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation. (arXiv:2106.07472v2 [cs.LG] UPDATED)
    Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.  ( 2 min )
    Nonconvex Extension of Generalized Huber Loss for Robust Learning and Pseudo-Mode Statistics. (arXiv:2202.11141v1 [stat.ML])
    We propose an extended generalization of the pseudo Huber loss formulation. We show that using the log-exp transform together with the logistic function, we can create a loss which combines the desirable properties of the strictly convex losses with robust loss functions. With this formulation, we show that a linear convergence algorithm can be utilized to find a minimizer. We further discuss the creation of a quasi-convex composite loss and provide a derivative-free exponential convergence rate algorithm.
    A duality connecting neural network and cosmological dynamics. (arXiv:2202.11104v1 [gr-qc])
    We demonstrate that the dynamics of neural networks trained with gradient descent and the dynamics of scalar fields in a flat, vacuum energy dominated Universe are structurally profoundly related. This duality provides the framework for synergies between these systems, to understand and explain neural network dynamics and new ways of simulating and describing early Universe models. Working in the continuous-time limit of neural networks, we analytically match the dynamics of the mean background and the dynamics of small perturbations around the mean field, highlighting potential differences in separate limits. We perform empirical tests of this analytic description and quantitatively show the dependence of the effective field theory parameters on hyperparameters of the neural network. As a result of this duality, the cosmological constant is matched inversely to the learning rate in the gradient descent update.
    Deep Recurrent Modelling of Granger Causality with Latent Confounding. (arXiv:2202.11286v1 [cs.LG])
    Inferring causal relationships in observational time series data is an important task when interventions cannot be performed. Granger causality is a popular framework to infer potential causal mechanisms between different time series. The original definition of Granger causality is restricted to linear processes and leads to spurious conclusions in the presence of a latent confounder. In this work, we harness the expressive power of recurrent neural networks and propose a deep learning-based approach to model non-linear Granger causality by directly accounting for latent confounders. Our approach leverages multiple recurrent neural networks to parameterise predictive distributions and we propose the novel use of a dual-decoder setup to conduct the Granger tests. We demonstrate the model performance on non-linear stochastic time series for which the latent confounder influences the cause and effect with different time lags; results show the effectiveness of our model compared to existing benchmarks.
    Finding Safe Zones of policies Markov Decision Processes. (arXiv:2202.11593v1 [cs.LG])
    Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate SafeZones. Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm.
    Quantum Heterogeneous Distributed Deep Learning Architectures: Models, Discussions, and Applications. (arXiv:2202.11200v1 [quant-ph])
    Deep learning (DL) has already become a state-of-the-art technology for various data processing tasks. However, data security and computational overload problems frequently occur due to their high data and computational power dependence. To solve this problem, quantum deep learning (QDL) and distributed deep learning (DDL) are emerging to complement existing DL methods by reducing computational overhead and strengthening data security. Furthermore, a quantum distributed deep learning (QDDL) technique that combines these advantages and maximizes them is in the spotlight. QDL takes computational gains by replacing deep learning computations on local devices and servers with quantum deep learning. On the other hand, besides the advantages of the existing distributed learning structure, it can increase data security by using a quantum secure communication protocol between the server and the client. Although many attempts have been made to confirm and demonstrate these various possibilities, QDDL research is still in its infancy. This paper discusses the model structure studied so far and its possibilities and limitations to introduce and promote these studies. It also discusses the areas of applied research so far and in the future and the possibilities of new methodologies.
    Margin-distancing for safe model explanation. (arXiv:2202.11266v1 [cs.LG])
    The growing use of machine learning models in consequential settings has highlighted an important and seemingly irreconcilable tension between transparency and vulnerability to gaming. While this has sparked sizable debate in legal literature, there has been comparatively less technical study of this contention. In this work, we propose a clean-cut formulation of this tension and a way to make the tradeoff between transparency and gaming. We identify the source of gaming as being points close to the \emph{decision boundary} of the model. And we initiate an investigation on how to provide example-based explanations that are expansive and yet consistent with a version space that is sufficiently uncertain with respect to the boundary points' labels. Finally, we furnish our theoretical results with empirical investigations of this tradeoff on real-world datasets.
    GIFT: Graph-guIded Feature Transfer for Cold-Start Video Click-Through Rate Prediction. (arXiv:2202.11525v1 [cs.IR])
    Short video has witnessed rapid growth in China and shows a promising market for promoting the sales of products in e-commerce platforms like Taobao. To ensure the freshness of the content, the platform needs to release a large number of new videos every day, which makes the conventional click-through rate (CTR) prediction model suffer from the severe item cold-start problem. In this paper, we propose GIFT, an efficient Graph-guIded Feature Transfer system, to fully take advantages of the rich information of warmed-up videos that related to the cold-start video. More specifically, we conduct feature transfer from warmed-up videos to those cold-start ones by involving the physical and semantic linkages into a heterogeneous graph. The former linkages consist of those explicit relationships (e.g., sharing the same category, under the same authorship etc.), while the latter measure the proximity of multimodal representations of two videos. In practice, the style, content, and even the recommendation pattern are pretty similar among those physically or semantically related videos. Besides, in order to provide the robust id representations and historical statistics obtained from warmed-up neighbors that cold-start videos covet most, we elaborately design the transfer function to make aware of different transferred features from different types of nodes and edges along the metapath on the graph. Extensive experiments on a large real-world dataset show that our GIFT system outperforms SOTA methods significantly and brings a 6.82% lift on click-through rate (CTR) in the homepage of Taobao App.
    Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. (arXiv:2202.11566v1 [cs.LG])
    Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.
    FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks. (arXiv:2202.11364v1 [cs.LG])
    Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the inputs length. Linear Transformers were proposed to address this limitation. However, these models have shown weaker performance on the long sequence tasks comparing to the original one. In this paper, we explore Linear Transformer models, rethinking their two core components. Firstly, we improved Linear Transformer with Shift-Invariant Kernel Function SIKF, which achieve higher accuracy without loss in speed. Secondly, we introduce FastRPB which stands for Fast Relative Positional Bias, which efficiently adds positional information to self-attention using Fast Fourier Transformation. FastRPB is independent of the self-attention mechanism and can be combined with an original self-attention and all its efficient variants. FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
    Reinforcement Learning from Demonstrations by Novel Interactive Expert and Application to Automatic Berthing Control Systems for Unmanned Surface Vessel. (arXiv:2202.11325v1 [eess.SY])
    In this paper, two novel practical methods of Reinforcement Learning from Demonstration (RLfD) are developed and applied to automatic berthing control systems for Unmanned Surface Vessel. A new expert data generation method, called Model Predictive Based Expert (MPBE) which combines Model Predictive Control and Deep Deterministic Policy Gradient, is developed to provide high quality supervision data for RLfD algorithms. A straightforward RLfD method, model predictive Deep Deterministic Policy Gradient (MP-DDPG), is firstly introduced by replacing the RL agent with MPBE to directly interact with the environment. Then distribution mismatch problem is analyzed for MP-DDPG, and two techniques that alleviate distribution mismatch are proposed. Furthermore, another novel RLfD algorithm based on the MP-DDPG, called Self-Guided Actor-Critic (SGAC) is present, which can effectively leverage MPBE by continuously querying it to generate high quality expert data online. The distribution mismatch problem leading to unstable learning process is addressed by SGAC in a DAgger manner. In addition, theoretical analysis is given to prove that SGAC algorithm can converge with guaranteed monotonic improvement. Simulation results verify the effectiveness of MP-DDPG and SGAC to accomplish the ship berthing control task, and show advantages of SGAC comparing with other typical reinforcement learning algorithms and MP-DDPG.
  • Open

    Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition. (arXiv:2202.11461v1 [math.ST])
    The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory. In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory.  ( 2 min )
    Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut. (arXiv:2202.11539v1 [cs.CV])
    Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.  ( 2 min )
    A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality. (arXiv:2202.11685v1 [cs.LG])
    We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.
    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales. (arXiv:2109.07623v2 [cs.SD] UPDATED)
    Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.
    Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting. (arXiv:2202.11356v1 [cs.LG])
    Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
    Learning Quantile Functions without Quantile Crossing for Distribution-free Time Series Forecasting. (arXiv:2111.06581v2 [cs.LG] UPDATED)
    Quantile regression is an effective technique to quantify uncertainty, fit challenging underlying distributions, and often provide full probabilistic predictions through joint learnings over multiple quantile levels. A common drawback of these joint quantile regressions, however, is \textit{quantile crossing}, which violates the desirable monotone property of the conditional quantile function. In this work, we propose the Incremental (Spline) Quantile Functions I(S)QF, a flexible and efficient distribution-free quantile estimation framework that resolves quantile crossing with a simple neural network layer. Moreover, I(S)QF inter/extrapolate to predict arbitrary quantile levels that differ from the underlying training ones. Equipped with the analytical evaluation of the continuous ranked probability score of I(S)QF representations, we apply our methods to NN-based times series forecasting cases, where the savings of the expensive re-training costs for non-trained quantile levels is particularly significant. We also provide a generalization error analysis of our proposed approaches under the sequence-to-sequence setting. Lastly, extensive experiments demonstrate the improvement of consistency and accuracy errors over other baselines.
    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning. (arXiv:2111.11556v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
    TD-GEN: Graph Generation With Tree Decomposition. (arXiv:2106.10656v2 [cs.LG] UPDATED)
    We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Finally, we discuss the shortcomings of standard evaluation criteria based on statistical properties of the generated graphs as performance measures. We propose to compare the performance of models based on likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.
    A general sample complexity analysis of vanilla policy gradient. (arXiv:2107.11433v3 [cs.LG] UPDATED)
    We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A \geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$, including the single trajectory case (i.e., $m=1$). We then instantiate our theorem in different settings, where we both recover existing results and obtained improved sample complexity, e.g., for convergence to the global optimum for Fisher-non-degenerated parameterized policies.
    Laplacian Constrained Precision Matrix Estimation: Existence and High Dimensional Consistency. (arXiv:2111.00590v2 [stat.ML] UPDATED)
    This paper considers the problem of estimating high dimensional Laplacian constrained precision matrices by minimizing Stein's loss. We obtain a necessary and sufficient condition for existence of this estimator, that consists on checking whether a certain data dependent graph is connected. We also prove consistency in the high dimensional setting under the symmetrized Stein loss. We show that the error rate does not depend on the graph sparsity, or other type of structure, and that Laplacian constraints are sufficient for high dimensional consistency. Our proofs exploit properties of graph Laplacians, the matrix tree theorem, and a characterization of the proposed estimator based on effective graph resistances. We validate our theoretical claims with numerical experiments.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v1 [cs.LG])
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v1 [cs.LG])
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    Neural Generalised AutoRegressive Conditional Heteroskedasticity. (arXiv:2202.11285v1 [cs.LG])
    We propose Neural GARCH, a class of methods to model conditional heteroskedasticity in financial time series. Neural GARCH is a neural network adaptation of the GARCH 1,1 model in the univariate case, and the diagonal BEKK 1,1 model in the multivariate case. We allow the coefficients of a GARCH model to be time varying in order to reflect the constantly changing dynamics of financial markets. The time varying coefficients are parameterised by a recurrent neural network that is trained with stochastic gradient variational Bayes. We propose two variants of our model, one with normal innovations and the other with Students t innovations. We test our models on a wide range of univariate and multivariate financial time series, and we find that the Neural Students t model consistently outperforms the others.
    Efficient CDF Approximations for Normalizing Flows. (arXiv:2202.11322v1 [cs.LG])
    Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.
    FastSHAP: Real-Time Shapley Value Estimation. (arXiv:2107.07436v2 [stat.ML] UPDATED)
    Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.
    Differentially Private Regression with Unbounded Covariates. (arXiv:2202.11199v1 [cs.CR])
    We provide computationally efficient, differentially private algorithms for the classical regression settings of Least Squares Fitting, Binary Regression and Linear Regression with unbounded covariates. Prior to our work, privacy constraints in such regression settings were studied under strong a priori bounds on covariates. We consider the case of Gaussian marginals and extend recent differentially private techniques on mean and covariance estimation (Kamath et al., 2019; Karwa and Vadhan, 2018) to the sub-gaussian regime. We provide a novel technical analysis yielding differentially private algorithms for the above classical regression settings. Through the case of Binary Regression, we capture the fundamental and widely-studied models of logistic regression and linearly-separable SVMs, learning an unbiased estimate of the true regression vector, up to a scaling factor.
    Exploiting Side Information for Improved Online Learning Algorithms in Wireless Networks. (arXiv:2202.11699v1 [cs.IT])
    In wireless networks, the rate achieved depends on factors like level of interference, hardware impairments, and channel gain. Often, instantaneous values of some of these factors can be measured, and they provide useful information about the instantaneous rate achieved. For example, higher interference implies a lower rate. In this work, we treat any such measurable quality that has a non-zero correlation with the rate achieved as side-information and study how it can be exploited to quickly learn the channel that offers higher throughput (reward). When the mean value of the side-information is known, using control variate theory we develop algorithms that require fewer samples to learn the parameters and can improve the learning rate compared to cases where side-information is ignored. Specifically, we incorporate side-information in the classical Upper Confidence Bound (UCB) algorithm and quantify the gain achieved in the regret performance. We show that the gain is proportional to the amount of the correlation between the reward and associated side-information. We discuss in detail various side-information that can be exploited in cognitive radio and air-to-ground communication in $L-$band. We demonstrate that correlation between the reward and side-information is often strong in practice and exploiting it improves the throughput significantly.
    Nonconvex Extension of Generalized Huber Loss for Robust Learning and Pseudo-Mode Statistics. (arXiv:2202.11141v1 [stat.ML])
    We propose an extended generalization of the pseudo Huber loss formulation. We show that using the log-exp transform together with the logistic function, we can create a loss which combines the desirable properties of the strictly convex losses with robust loss functions. With this formulation, we show that a linear convergence algorithm can be utilized to find a minimizer. We further discuss the creation of a quasi-convex composite loss and provide a derivative-free exponential convergence rate algorithm.
    Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms. (arXiv:2202.11277v1 [cs.IT])
    We consider the problem of quantizing a linear model learned from measurements $\mathbf{X} = \mathbf{W}\boldsymbol{\theta} + \mathbf{v}$. The model is constrained to be representable using only $dB$-bits, where $B \in (0, \infty)$ is a pre-specified budget and $d$ is the dimension of the model. We derive an information-theoretic lower bound for the minimax risk under this setting and show that it is tight with a matching upper bound. This upper bound is achieved using randomized embedding based algorithms. We propose randomized Hadamard embeddings that are computationally efficient while performing near-optimally. We also show that our method and upper-bounds can be extended for two-layer ReLU neural networks. Numerical simulations validate our theoretical claims.
    Contextual Combinatorial Conservative Bandits. (arXiv:1911.11337v2 [cs.LG] UPDATED)
    The problem of multi-armed bandits (MAB) asks to make sequential decisions while balancing between exploitation and exploration, and have been successfully applied to a wide range of practical scenarios. Various algorithms have been designed to achieve a high reward in a long term. However, its short-term performance might be rather low, which is injurious in risk sensitive applications. Building on previous work of conservative bandits, we bring up a framework of contextual combinatorial conservative bandits. An algorithm is presented and a regret bound of $\tilde O(d^2+d\sqrt{T})$ is proven, where $d$ is the dimension of the feature vectors, and $T$ is the total number of time steps. We further provide an algorithm as well as regret analysis for the case when the conservative reward is unknown. Experiments are conducted, and the results validate the effectiveness of our algorithm.
    Many processors, little time: MCMC for partitions via optimal transport couplings. (arXiv:2202.11258v1 [stat.ME])
    Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.
    Differential privacy for symmetric log-concave mechanisms. (arXiv:2202.11393v1 [cs.CR])
    Adding random noise to database query results is an important tool for achieving privacy. A challenge is to minimize this noise while still meeting privacy requirements. Recently, a sufficient and necessary condition for $(\epsilon, \delta)$-differential privacy for Gaussian noise was published. This condition allows the computation of the minimum privacy-preserving scale for this distribution. We extend this work and provide a sufficient and necessary condition for $(\epsilon, \delta)$-differential privacy for all symmetric and log-concave noise densities. Our results allow fine-grained tailoring of the noise distribution to the dimensionality of the query result. We demonstrate that this can yield significantly lower mean squared errors than those incurred by the currently used Laplace and Gaussian mechanisms for the same $\epsilon$ and $\delta$.
    Towards Speaker Age Estimation with Label Distribution Learning. (arXiv:2202.11424v1 [cs.SD])
    Existing methods for speaker age estimation usually treat it as a multi-class classification or a regression problem. However, precise age identification remains a challenge due to label ambiguity, \emph{i.e.}, utterances from adjacent age of the same person are often indistinguishable. To address this, we utilize the ambiguous information among the age labels, convert each age label into a discrete label distribution and leverage the label distribution learning (LDL) method to fit the data. For each audio data sample, our method produces a age distribution of its speaker, and on top of the distribution we also perform two other tasks: age prediction and age uncertainty minimization. Therefore, our method naturally combines the age classification and regression approaches, which enhances the robustness of our method. We conduct experiments on the public NIST SRE08-10 dataset and a real-world dataset, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding a 10\% reduction in terms of mean absolute error (MAE) on a real-world dataset.
    Sparse Orthogonal Variational Inference for Gaussian Processes. (arXiv:1910.10596v4 [stat.ML] UPDATED)
    We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.
    Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization. (arXiv:2009.04899v6 [cs.LG] UPDATED)
    In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.
    A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence. (arXiv:2202.11598v1 [stat.ML])
    A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v1 [cs.LG])
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    How Many Data Are Needed for Robust Learning?. (arXiv:2202.11592v1 [cs.LG])
    We show that the sample complexity of robust interpolation problem could be exponential in the input dimensionality and discover a phase transition phenomenon when the data are in a unit ball. Robust interpolation refers to the problem of interpolating $n$ noisy training data in $\R^d$ by a Lipschitz function. Although this problem has been well understood when the covariates are drawn from an isoperimetry distribution, much remains unknown concerning its performance under generic or even the worst-case distributions. Our results are two-fold: 1) too many data hurt robustness; we provide a tight and universal Lipschitzness lower bound $\Omega(n^{1/d})$ of the interpolating function for arbitrary data distributions. Our result disproves potential existence of an $\mathcal{O}(1)$-Lipschitz function in the overparametrization scenario when $n=\exp(\omega(d))$. 2) Small data hurt robustness: $n=\exp(\Omega(d))$ is necessary for obtaining a good population error under certain distributions by any $\mathcal{O}(1)$-Lipschitz learning algorithm. Perhaps surprisingly, our results shed light on the curse of big data and the blessing of dimensionality for robustness, and discover an intriguing phenomenon of phase transition at $n=\exp(\Theta(d))$.
    Wide Mean-Field Bayesian Neural Networks Ignore the Data. (arXiv:2202.11670v1 [cs.LG])
    Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
    Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation. (arXiv:2106.07472v2 [cs.LG] UPDATED)
    Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.
    Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery. (arXiv:1908.01842v5 [cs.LG] UPDATED)
    Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a H\"{o}lder function in $\mathcal{H}^{s,\alpha}$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$, with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha)}{2(s+\alpha) + d}}\log^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
    Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data. (arXiv:2006.10833v3 [cs.LG] UPDATED)
    On time-series data, most causal discovery methods fit a new model whenever they encounter samples from a new underlying causal graph. However, these samples often share relevant information which is lost when following this approach. Specifically, different samples may share the dynamics which describe the effects of their causal relations. We propose Amortized Causal Discovery, a novel framework that leverages such shared dynamics to learn to infer causal relations from time-series data. This enables us to train a single, amortized model that infers causal relations across samples with different underlying causal graphs, and thus leverages the shared dynamics information. We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance, and show how it can be extended to perform well under added noise and hidden confounding.
    Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation. (arXiv:2202.11585v1 [stat.ML])
    Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using path signatures based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks.
    Traversing Time with Multi-Resolution Gaussian Process State-Space Models. (arXiv:2112.03230v2 [cs.LG] UPDATED)
    Gaussian Process state-space models capture complex temporal dependencies in a principled manner by placing a Gaussian Process prior on the transition function. These models have a natural interpretation as discretized stochastic differential equations, but inference for long sequences with fast and slow transitions is difficult. Fast transitions need tight discretizations whereas slow transitions require backpropagating the gradients over long subtrajectories. We propose a novel Gaussian process state-space architecture composed of multiple components, each trained on a different resolution, to model effects on different timescales. The combined model allows traversing time on adaptive scales, providing efficient inference for arbitrarily long sequences with complex dynamics. We benchmark our novel method on semi-synthetic data and on an engine modeling task. In both experiments, our approach compares favorably against its state-of-the-art alternatives that operate on a single time-scale only.
    Robust Geometric Metric Learning. (arXiv:2202.11550v1 [stat.ML])
    This paper proposes new algorithms for the metric learning problem. We start by noticing that several classical metric learning formulations from the literature can be viewed as modified covariance matrix estimation problems. Leveraging this point of view, a general approach, called Robust Geometric Metric Learning (RGML), is then studied. This method aims at simultaneously estimating the covariance matrix of each class while shrinking them towards their (unknown) barycenter. We focus on two specific costs functions: one associated with the Gaussian likelihood (RGML Gaussian), and one with Tyler's M -estimator (RGML Tyler). In both, the barycenter is defined with the Riemannian distance, which enjoys nice properties of geodesic convexity and affine invariance. The optimization is performed using the Riemannian geometry of symmetric positive definite matrices and its submanifold of unit determinant. Finally, the performance of RGML is asserted on real datasets. Strong performance is exhibited while being robust to mislabeled data.
    Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis. (arXiv:2103.03568v4 [cs.LG] UPDATED)
    Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.
    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. (arXiv:2202.11559v1 [physics.comp-ph])
    Parameter reconstructions are indispensable in metrology. Here, on wants to explain $K$ experimental measurements by fitting to them a parameterized model of the measurement process. The model parameters are regularly determined by least-square methods, i.e., by minimizing the sum of the squared residuals between the $K$ model predictions and the $K$ experimental observations, $\chi^2$. The model functions often involve computationally demanding numerical simulations. Bayesian optimization methods are specifically suited for minimizing expensive model functions. However, in contrast to least-square methods such as the Levenberg-Marquardt algorithm, they only take the value of $\chi^2$ into account, and neglect the $K$ individual model outputs. We introduce a Bayesian target-vector optimization scheme that considers all $K$ contributions of the model function and that is specifically suited for parameter reconstruction problems which are often based on hundreds of observations. Its performance is compared to established methods for an optical metrology reconstruction problem and two synthetic least-squares problems. The proposed method outperforms established optimization methods. It also enables to determine accurate uncertainty estimates with very few observations of the actual model function by using Markov chain Monte Carlo sampling on a trained surrogate model.
    A Framework to Learn with Interpretation. (arXiv:2010.09345v4 [cs.LG] UPDATED)
    To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
    Testing Granger Non-Causality in Panels with Cross-Sectional Dependencies. (arXiv:2202.11612v1 [stat.ME])
    This paper proposes a new approach for testing Granger non-causality on panel data. Instead of aggregating panel member statistics, we aggregate their corresponding p-values and show that the resulting p-value approximately bounds the type I error by the chosen significance level even if the panel members are dependent. We compare our approach against the most widely used Granger causality algorithm on panel data and show that our approach yields lower FDR at the same power for large sample sizes and panels with cross-sectional dependencies. Finally, we examine COVID-19 data about confirmed cases and deaths measured in countries/regions worldwide and show that our approach is able to discover the true causal relation between confirmed cases and deaths while state-of-the-art approaches fail.
    Curiosity-Driven Exploration via Latent Bayesian Surprise. (arXiv:2104.07495v2 [cs.LG] UPDATED)
    The human intrinsic desire to pursue knowledge, also known as curiosity, is considered essential in the process of skill acquisition. With the aid of artificial curiosity, we could equip current techniques for control, such as Reinforcement Learning, with more natural exploration capabilities. A promising approach in this respect has consisted of using Bayesian surprise on model parameters, i.e. a metric for the difference between prior and posterior beliefs, to favour exploration. In this contribution, we propose to apply Bayesian surprise in a latent space representing the agent's current understanding of the dynamics of the system, drastically reducing the computational costs. We extensively evaluate our method by measuring the agent's performance in terms of environment exploration, for continuous tasks, and looking at the game scores achieved, for video games. Our model is computationally cheap and compares positively with current state-of-the-art methods on several problems. We also investigate the effects caused by stochasticity in the environment, which is often a failure case for curiosity-driven agents. In this regime, the results suggest that our approach is resilient to stochastic transitions.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v1 [stat.ML])
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    Finding Safe Zones of policies Markov Decision Processes. (arXiv:2202.11593v1 [cs.LG])
    Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate SafeZones. Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm.
    Parallel MCMC Without Embarrassing Failures. (arXiv:2202.11154v1 [stat.ML])
    Embarrassingly parallel Markov Chain Monte Carlo (MCMC) exploits parallel computing to scale Bayesian inference to large datasets by using a two-step approach. First, MCMC is run in parallel on (sub)posteriors defined on data partitions. Then, a server combines local results. While efficient, this framework is very sensitive to the quality of subposterior sampling. Common sampling problems such as missing modes or misrepresentation of low-density regions are amplified -- instead of being corrected -- in the combination phase, leading to catastrophic failures. In this work, we propose a novel combination strategy to mitigate this issue. Our strategy, Parallel Active Inference (PAI), leverages Gaussian Process (GP) surrogate modeling and active learning. After fitting GPs to subposteriors, PAI (i) shares information between GP surrogates to cover missing modes; and (ii) uses active sampling to individually refine subposterior approximations. We validate PAI in challenging benchmarks, including heavy-tailed and multi-modal posteriors and a real-world application to computational neuroscience. Empirical results show that PAI succeeds where previous methods catastrophically fail, with a small communication overhead.
    Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance. (arXiv:2202.11632v1 [stat.ML])
    We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.
    On PAC-Bayesian reconstruction guarantees for VAEs. (arXiv:2202.11455v1 [cs.LG])
    Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives. We illustrate our theoretical results with supporting experiments on classical benchmark datasets.
    NetRCA: An Effective Network Fault Cause Localization Algorithm. (arXiv:2202.11269v1 [cs.LG])
    Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from the original raw data by considering temporal, directional, attribution, and interaction characteristics. Secondly, we adopt multivariate time series similarity and label propagation to generate new training data from both labeled and unlabeled data to overcome the lack of labeled samples. Thirdly, we design an ensemble model which combines XGBoost, rule set learning, attribution model, and graph algorithm, to fully utilize all data information and enhance performance. Finally, experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge to demonstrate the superiority and effectiveness of our approach.
    SSSNET: Semi-Supervised Signed Network Clustering. (arXiv:2110.06623v3 [cs.SI] UPDATED)
    Node embeddings are a powerful tool in the analysis of networks; yet, their full potential for the important task of node clustering has not been fully exploited. In particular, most state-of-the-art methods generating node embeddings of signed networks focus on link sign prediction, and those that pertain to node clustering are usually not graph neural network (GNN) methods. Here, we introduce a novel probabilistic balanced normalized cut loss for training nodes in a GNN framework for semi-supervised signed network clustering, called SSSNET. The method is end-to-end in combining embedding generation and clustering without an intermediate step; it has node clustering as main focus, with an emphasis on polarization effects arising in networks. The main novelty of our approach is a new take on the role of social balance theory for signed network embeddings. The standard heuristic for justifying the criteria for the embeddings hinges on the assumption that "an enemy's enemy is a friend". Here, instead, a neutral stance is assumed on whether or not the enemy of an enemy is a friend. Experimental results on various data sets, including a synthetic signed stochastic block model, a polarized version of it, and real-world data at different scales, demonstrate that SSSNET can achieve comparable or better results than state-of-the-art spectral clustering methods, for a wide range of noise and sparsity levels. SSSNET complements existing methods through the possibility of including exogenous information, in the form of node-level features or labels.
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v1 [cs.LG])
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code will be made publicly available.
    No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling. (arXiv:2202.11219v1 [cs.LG])
    For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as logarithmic pooling -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Our main result is an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.
    Reward-Weighted Regression Converges to a Global Optimum. (arXiv:2107.09088v3 [stat.ML] UPDATED)
    Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
    Globally Convergent Policy Search over Dynamic Filters for Output Estimation. (arXiv:2202.11659v1 [math.OC])
    We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
    A spectral-based analysis of the separation between two-layer neural networks and linear methods. (arXiv:2108.04964v2 [stat.ML] UPDATED)
    We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights.
    A new LDA formulation with covariates. (arXiv:2202.11527v1 [cs.IR])
    The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
    A Complete Criterion for Value of Information in Soluble Influence Diagrams. (arXiv:2202.11629v1 [cs.AI])
    Influence diagrams have recently been used to analyse the safety and fairness properties of AI systems. A key building block for this analysis is a graphical criterion for value of information (VoI). This paper establishes the first complete graphical criterion for VoI in influence diagrams with multiple decisions. Along the way, we establish two important techniques for proving properties of multi-decision influence diagrams: ID homomorphisms are structure-preserving transformations of influence diagrams, while a Tree of Systems is collection of paths that captures how information and control can flow in an influence diagram.
    Out-of-distribution Generalization via Partial Feature Decorrelation. (arXiv:2007.15241v4 [cs.LG] UPDATED)
    Most deep-learning-based image classification methods assume that all samples are generated under an independent and identically distributed (IID) setting. However, out-of-distribution (OOD) generalization is more common in practice, which means an agnostic context distribution shift between training and testing environments. To address this problem, we present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimizes a feature decomposition network and the target image classification model. The feature decomposition network decomposes feature embeddings into the independent and the correlated parts such that the correlations between features will be highlighted. Then, the correlated features help learn a stable feature representation by decorrelating the highlighted correlations while optimizing the image classification model. We verify the correlation modeling ability of the feature decomposition network on a synthetic dataset. The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.

  • Open

    "Learning Causal Overhypotheses through Exploration in Children and Computational Models", Kosoy et al 2022
    submitted by /u/gwern [link] [comments]
    "QET: Selective Credit Assignment", Chelu et al 2022 {DM}
    submitted by /u/gwern [link] [comments]
    "VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning", Wang et al 2022 (supervised pretraining, then offline, then online)
    submitted by /u/gwern [link] [comments]
    RL Algorithms Implementation Study Group 2
    Hello everyone, hope all of you are having a great day. I am a Urela and I have a background in computer engineering. I was wondering if anyone would be interested in joining my RL Algorithms Implementation Study Group. I am interested in implementing some SOTA RL algorithms such as IMPALA, SAC, A3C, AlphaZero and I hope to find people with similar interests to work together. I wish to recreate the study group that Costa Huang did. To be transparent, I am final year undergrad student, I comfortable running the group but if someone more experienced is will to take over, I am happy with that. I am very comfortable with the maths and reading the papers, worst case i step down and someone takes over. Costa Huang's Group: https://www.youtube.com/channel/UCDdC6BIFRI0jvcwuhi3aI6w/videos https://www.reddit.com/r/reinforcementlearning/comments/dxebbp/rl_algorithms_implementation_study_group/ ​ Some notes: Costa Hung has no affiliations with this group.!!! If such a group exist, I am sorry to waste your time, please point me in the direction ​ You should probably be dedicating a solid 3 hours+ per week to gain something out of this. It's fine if you just want to join to see what is going on, but if you are willing to commit 3 hours+ per week with this group, please PM me suggest to I will create a discord link and if more than 3 people want to join submitted by /u/Urenmila [link] [comments]  ( 1 min )
    Optimal Hedge With Monte-Carlo
    For https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/mcCal/optimal-hedge-with-monte-carlo , how to use equation 53 to derive equations 54, 55 and 56 ? Besides, I do not understand how the theoretical formula comparison looks similar ? https://preview.redd.it/32fv9ax95pj81.png?width=961&format=png&auto=webp&s=2b6cf56072f977ea5c56f87fe202bf13f941ae45 https://preview.redd.it/we915j8b5pj81.png?width=959&format=png&auto=webp&s=91e8cf89bb2930e327cd28b52ab45c05fa77ce43 https://preview.redd.it/ph4uu7h7vrj81.png?width=959&format=png&auto=webp&s=58e3152812930e99e1b5e2fdc3af6a8783d0ec7b submitted by /u/promach [link] [comments]
    Deriving expression for optimal Q-function
    How to use equation 31 in https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/b8ZIT/mdp-formulation to derive equation 49 of https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/Nyz4f/backward-recursion-for-q-star ? https://preview.redd.it/qzfwhqkpsoj81.png?width=963&format=png&auto=webp&s=df35c34a9268636c5689e9f0a0855d6e43396a19 https://preview.redd.it/le4z7fpqsoj81.png?width=954&format=png&auto=webp&s=35fdff53e6da7ea954c4f06d98dd9798d85fd4d5 submitted by /u/promach [link] [comments]
  • Open

    [P] Hawai - An AI Model capable of friendly wellness support on everyday topics
    Hi We have a very friendly and supportive wellness bot called Hawai. You can share about everyday stuffs and get tips on feeling good and happy. Give it a shot! Spread the word, especially given that millions are affected and feel so isolated!!! You can help just by sharing in social media and among your friends!!! Thanks!!! Start a chat with Hawai - Health and wellbeing AI on Chai![ https://chai.ml/chat/share/\_bot\_d6f48541-d7a4-4f79-81fa-ec2c1b464b9e](https://chai.ml/chat/share/_bot_d6f48541-d7a4-4f79-81fa-ec2c1b464b9e) submitted by /u/SnooPears9870 [link] [comments]  ( 1 min )
    [D] Do you have a tool that let's business users run your models on a self-serve basis?
    I'm curious to hear the thoughts of the community. How many of your models are actually deployed in production/automated environments, and what proportion is used locally that you then send out as a report/csv/dashboard etc.? I see a lot of conversations here that talk about production pipelines and deployments but I work with teams that do tons of analysis, especially for smaller projects, that are done locally. DS teams end up working on business requests that involve downloading files, building some analysis and then sending results out as csvs or dashboards. Is that true for you as well? Are there tools you use to publish non production models so business users (that can't code) can use models on a self-serve basis? submitted by /u/dmart89 [link] [comments]  ( 1 min )
    [D] Treating temporal dimension as channel axis?
    For those of you who have worked with temporal data, have you ever trained a conv net where you treat the temporal axis as the channel axis for depth wise convolutions? In particular you would be extracting local correlations on the spatial information throughout the temporal sequence. ​ If you havent personally are you aware of any work that has done this? submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [Research] Looking for volunteers to evaluate multilingual dialogue models (chatbots)!
    Hi! Are you frustrated that most chatbots are only in English? Would you want your native language to be part of a conversational AI system as well? I certainly do! That’s why I’m doing my master’s thesis on multilingual open-domain dialogue systems. If you also feel the same and want to contribute to research on multilingual, intelligent chatbots, then you’re the perfect fit for my thesis project! I’m currently looking for people who can speak one of the following languages fluently: Arabic Bengali Finnish Japanese Korean The task is to post-edit some data in your language (after I used machine translation to translate it from English to the languages) or evaluate the models’ responses in your language. There are 30 short dialogue exchanges in the post editing task. Your task is to check the post machine translated text and correct the grammatical and/or fluency issues if needed. As for the system evaluation, you will chat with the chatbots and rate them on a scale of 5 about your view of their fluency, engagingness, naturalness, etc. I’m looking for 1 volunteer in each language for the post editing task and 2 volunteers in each language for the evaluation task. The tasks are pretty small so they won't take too much of your time. And it'd be fun to chat with the bots (I think) :) The post editing task preferably starts asap and the evaluation task will kick-start by the end of March! If you’re interested, please fill out this form :) I’ll of course credit you in the paper unless you would like to stay anonymous. This is a project I’m really passionate about, and I hope it’ll encourage more research on multilingual dialogue systems in the community! Please feel free to let me know if you’d like to know more! I really, really appreciate it. You can dm me or just leave a comment here. To sign up, please kindly fill out this google form. Thank you so much for reading this post patiently :) Hope you have a great day! submitted by /u/iAmagent606 [link] [comments]  ( 2 min )
    ML writing competition: an essay about ML work/research [P]
    https://thegradient.pub/gradient-prize/ submitted by /u/pahita [link] [comments]
    [D] Software engineers building apps with ML features
    Hey everyone - I know many of you here might be deep into ML but I'm curious about what you think when it comes to tools that make it easier for the typical software engineer to build smart solutions. My background is in computer vision so I'll use it as an example. APIs like Google Vision API and Amazon Rekognition exist, but they're very limited in their applications. What if there really was some higher-level way to build vision applications that's flexible, and works? Imagine something along the lines of having motion detectors, color detectors, object detectors, and other things that worked out of the box. Some could ML-based, other could be traditional computer vision based. Curious to hear your thoughts and whether you've seen anything out there, that's more focused on build solutions rather than focusing on ML as the main function. submitted by /u/happybirthday290 [link] [comments]  ( 2 min )
    [D] Deep Learning (summer) school
    Hey ! I'm looking to register to a (summer) school on ML I came across Deep Learn school. It seems to run 4 times a year which seems a bit sketchy at the first glance. Does anyone already gone to one of this or has any nice summer school propositions ? submitted by /u/eXpensia [link] [comments]  ( 1 min )
    [D] Which are some good applications of machine learning and deep learning on big geospatial data?
    I have a competition going on at my college to harness the power of ML and DL for big geospatial data analysis. I got a few ideas in my mind but I'm looking for some inspiration from others. From my side, I have been researching on using GAN to create super-resolution images from low res satellite images for our self-developed geographic information system (it's like google maps but on a low budget and local scale). So, any ideas or thoughts are welcome. Thanks for your time. submitted by /u/Chrex_007 [link] [comments]  ( 1 min )
    [D] ACL 2022 Acceptances
    When are the ACL 2022 decisions expected to be out? submitted by /u/East-Beginning9987 [link] [comments]  ( 1 min )
    [D] What's hot for Machine Learning Research in 2022?
    Which of the sub-fields/approaches, application areas are expected to gain much attention (pun unintended) this year in the academia? PS: Please don't shy away from suggesting anything that you think or know could be the trending research topic in ML, it is quite likely that what you know can be relatively unknown to many of us here :) submitted by /u/ureepamuree [link] [comments]  ( 3 min )
    Preprocessing unbounded data [D]
    Does anyone know (or have an article) for how to preprocess data without imposing a lower or upper bound on the data? I want to train a GAN that takes inputs of shape (None, 512, 256, 2) and outputs (None, 512, 256, 3). Normally, people use this formula to normalize their data with, however this method makes the assumption that the data has a lower and upper bounds. My data has no bounds, and their values could theoretically range from positive and negative infinity. To make matters worse, in each of the last axis in my data (axis (:, :, :, i)), my data has a different mean and standard deviation, which means each axis needs to be normalized independently. submitted by /u/suoarski [link] [comments]  ( 1 min )
    [R] Observational Studies: Commentaries on Breimen’s Two Cultures paper
    Celebrating the 20th anniversary of Leo Breimen's "Two Cultures" paper, the journal Observational Studies got together a wide variety of scholars to discuss it's legacy. Best part? It's all open access! Enjoy: https://muse.jhu.edu/issue/45147 submitted by /u/bikeskata [link] [comments]
    [R] A Modern Self-Referential Weight Matrix That Learns To Modify Itself
    submitted by /u/EducationalCicada [link] [comments]
  • Open

    Deep-learning technique predicts clinical treatment outcomes
    A new methodology simulates counterfactual, time-varying, and dynamic treatment strategies, allowing doctors to choose the best course of action.  ( 7 min )
    More sensitive X-ray imaging
    Improvements in the material that converts X-rays into light, for medical or industrial images, could allow a tenfold signal enhancement.  ( 6 min )
  • Open

    What AI can you recommend me for writing a text about a topic, and I can tell the AI that I want exactly the word "cat (etc.)" on this place and it should fit in the text?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Artificial Intelligence in the News
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Fun! Found a new web tool that uses generative ML to "paint with words" and edit images based on your typed input
    submitted by /u/roxanneonreddit [link] [comments]
    Cleerly’s AI scans comparable to gold-standard angiography in detecting heart disease: study
    submitted by /u/DaStock_Doctor [link] [comments]
    Designing an intelligent low-cost Webcam for home security
    Hi all, I want to design a system that can use low cost camera to detect intruders illegally entering the premises. I would like to make a CNN using video footage of home invasions to train the model. Does anybody on this community know where I'll be able to get this data/ video repository of home invasions. Any help will be appreciated. submitted by /u/DaveTheAutist [link] [comments]  ( 1 min )
    LinkedIn’s job-matching AI was biased. The company’s solution? More AI. ZipRecruiter, CareerBuilder, LinkedIn—most of the world’s biggest job search sites use AI to match people with job openings. But the algorithms don’t always play fair.
    submitted by /u/LisaMck041 [link] [comments]  ( 1 min )
    In a Latest Machine Learning Research, Amazon Researchers Propose an End-To-End Noise-Tolerant Embedding Learning Framework, ‘PGE’, to Jointly Leverage Both Text Information and Graph Structure in PG to Learn Embeddings for Error Detection
    With the rapid rise of the internet, e-commerce websites like Amazon, eBay, and Walmart have become essential avenues for online buying and business transactions. Product knowledge graphs (PGs) have garnered significant interest in recent years as an effective approach to organizing product-related information, enabling several real-world applications such as product search and recommendations. A knowledge graph (KG) that represents product attribute values is referred to as a PG. It’s made up of data from a product catalog. Each product in a PG has numerous attributes, including the product brand, product category, and additional information about product qualities like flavor and ingredients. Unlike traditional KGs, where most triples take the form (head entity, relation, tail entity), the majority of triples in a PG take the form (product, attribute, attribute value), with the attribute value being a short text, such as (“Brand A Tortilla Chips Spicy Queso, 6 – 2 oz bags”, flavor, “Spicy Queso”). Attribute triples are a type of triple. Individual shops contribute the vast majority of product catalog data. These self-reported statistics are always riddled with inconsistencies, erroneous values, and confusing values. When such errors are absorbed by a PG, they cause the downstream applications to perform poorly. Manual validation is not possible in a PG due to a large number of products. A solution for automatic validation is desperately needed. CONTINUE READING Paper: https://www.amazon.science/publications/pge-robust-product-graph-embedding-learning-for-error-detection https://preview.redd.it/zus1321a9pj81.png?width=1112&format=png&auto=webp&s=fd99754bd739831d4052e20211d45d63211ff4c5 submitted by /u/techsucker [link] [comments]  ( 1 min )
    Yann LeCun on a vision to make AI systems learn and reason like animals and humans
    submitted by /u/nick7566 [link] [comments]
    How long till we see a movie with a script written solely by artificial intelligence?
    submitted by /u/Advanced_Teaching_16 [link] [comments]  ( 1 min )
  • Open

    What distinguishes individual units?
    Forgive me as it’s probably a simple question, but what stops each unit ending up with the same gradient and bias in a layer? I don’t see anything inherently different about eg the first and second units in a layer, yes they have different constants at the end of testing but what property makes those constants different? Thanks submitted by /u/slashchunks [link] [comments]  ( 1 min )
    Almost no one knows how easily you can optimize your AI models
    The situation is fairly simple. Your model could run 10 times faster by adding a few lines to your code, but you weren't aware of it. Let me expand on that. AI applications are multiplying like mushrooms, which is awesome As a result, more and more people are turning to the dark side, joining the AI world, as I did The problem? Developers focus only on AI, cleaning up datasets and training their models. Almost no one has a background in hardware, compilers, computing, cloud, etc The result? Developers spend a lot of hours improving the accuracy and performance of their software, and all their hard work risks being undone by the wrong choice of hardware-software coupling This problem bothered me for a long time, so with a couple of buddies at Nebuly (all ex MIT, ETH and EPFL), we put a lot of energy into an open-source library called nebullvm to make DL compiler technology accessible to any developer, even for those who know nothing about hardware, as I did. How does it work? It speeds up your DL models by ~5-20x by testing the best DL compilers out there and selecting the optimal one to best couple your AI model with your machine (GPU, CPU, etc.). All this in just a few lines of code. The library is open source and you can find it here https://github.com/nebuly-ai/nebullvm. Please leave a star on GitHub for the hard work in building the library :) It's a simple act for you, a big smile for us. Thank you, and don't hesitate to contribute to the library! And let me know in the comments if you have questions! And thank you all!! submitted by /u/emilec___ [link] [comments]  ( 1 min )
  • Open

    Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits. (arXiv:2106.14866v2 [stat.ML] UPDATED)
    We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation -- free of identifiability issues -- is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.  ( 2 min )
    n-CPS: Generalising Cross Pseudo Supervision to n Networks for Semi-Supervised Semantic Segmentation. (arXiv:2112.07528v3 [cs.CV] UPDATED)
    We present n-CPS - a generalisation of the recent state-of-the-art cross pseudo supervision (CPS) approach for the task of semi-supervised semantic segmentation. In n-CPS, there are n simultaneously trained subnetworks that learn from each other through one-hot encoding perturbation and consistency regularisation. We also show that ensembling techniques applied to subnetworks outputs can significantly improve the performance. To the best of our knowledge, n-CPS paired with CutMix outperforms CPS and sets the new state-of-the-art for Pascal VOC 2012 with (1/16, 1/8, 1/4, and 1/2 supervised regimes) and Cityscapes (1/16 supervised).  ( 2 min )
    Automated question generation and question answering from Turkish texts using text-to-text transformers. (arXiv:2111.06476v3 [cs.LG] UPDATED)
    While exam-style questions are a fundamental educational tool serving a variety of purposes, manual construction of questions is a complex process that requires training, experience and resources. Automatic question generation (QG) techniques can be utilized to satisfy the need for a continuous supply of new questions by streamlining their generation. However, compared to automatic question answering (QA), QG is a more challenging task. In this work, we fine-tune a multilingual T5 (mT5) transformer in a multi-task setting for QA, QG and answer extraction tasks using Turkish QA datasets. To the best of our knowledge, this is the first academic work that performs automated text-to-text question generation from Turkish texts. Experimental evaluations show that the proposed multi-task setting achieves state-of-the-art Turkish question answering and question generation performance on TQuADv1, TQuADv2 datasets and XQuAD Turkish split. The source code and the pre-trained models are available at https://github.com/obss/turkish-question-generation.  ( 2 min )
    Mixture-of-Variational-Experts for Continual Learning. (arXiv:2110.12667v3 [cs.LG] UPDATED)
    One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.  ( 2 min )
    Wastewater Pipe Condition Rating Model Using K- Nearest Neighbors. (arXiv:2202.11049v1 [cs.LG])
    Risk-based assessment in pipe condition mainly focuses on prioritizing the most critical assets by evaluating the risk of pipe failure. This paper's goal is to classify a comprehensive pipe rating model which is obtained based on a series of pipe physical, external, and hydraulic characteristics that are identified for the proposed methodology. The traditional manual method of assessing sewage structural conditions takes a long time. By building an automated process using K-Nearest Neighbors (K-NN), this study presents an effective technique to automate the identification of the pipe defect rating using the pipe repair data. First, we performed the Shapiro Wilks Test for 1240 data from the Dept. of Engineering & Environmental Services, Shreveport, Louisiana Phase 3 with 12 variables to determine if factors could be incorporated in the final rating. We then developed a K-Nearest Neighbors model to classify the final rating from the statistically significant factors identified in Shapiro Wilks Test. This classification process allows recognizing the worst condition of wastewater pipes that need to be replaced immediately. This comprehensive model is built according to the industry-accepted and used guidelines to estimate the overall condition. Finally, for validation purposes, the proposed model is applied to a small portion of a US wastewater collection system in Shreveport, Louisiana. Keywords: Pipe rating, Shapiro Wilks Test, K-Nearest Neighbors (KNN), Failure, Risk analysis  ( 2 min )
    Improving Classification Model Performance on Chest X-Rays through Lung Segmentation. (arXiv:2202.10971v1 [eess.IV])
    Chest radiography is an effective screening tool for diagnosing pulmonary diseases. In computer-aided diagnosis, extracting the relevant region of interest, i.e., isolating the lung region of each radiography image, can be an essential step towards improved performance in diagnosing pulmonary disorders. Methods: In this work, we propose a deep learning approach to enhance abnormal chest x-ray (CXR) identification performance through segmentations. Our approach is designed in a cascaded manner and incorporates two modules: a deep neural network with criss-cross attention modules (XLSor) for localizing lung region in CXR images and a CXR classification model with a backbone of a self-supervised momentum contrast (MoCo) model pre-trained on large-scale CXR data sets. The proposed pipeline is evaluated on Shenzhen Hospital (SH) data set for the segmentation module, and COVIDx data set for both segmentation and classification modules. Novel statistical analysis is conducted in addition to regular evaluation metrics for the segmentation module. Furthermore, the results of the optimized approach are analyzed with gradient-weighted class activation mapping (Grad-CAM) to investigate the rationale behind the classification decisions and to interpret its choices. Results and Conclusion: Different data sets, methods, and scenarios for each module of the proposed pipeline are examined for designing an optimized approach, which has achieved an accuracy of 0.946 in distinguishing abnormal CXR images (i.e., Pneumonia and COVID-19) from normal ones. Numerical and visual validations suggest that applying automated segmentation as a pre-processing step for classification improves the generalization capability and the performance of the classification models.  ( 2 min )
    DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression. (arXiv:2202.10584v1 [cs.LG])
    Data reduction in storage systems is becoming increasingly important as an effective solution to minimize the management cost of a data center. To maximize data-reduction efficiency, existing post-deduplication delta-compression techniques perform delta compression along with traditional data deduplication and lossless compression. Unfortunately, we observe that existing techniques achieve significantly lower data-reduction ratios than the optimal due to their limited accuracy in identifying similar data blocks. In this paper, we propose DeepSketch, a new reference search technique for post-deduplication delta compression that leverages the learning-to-hash method to achieve higher accuracy in reference search for delta compression, thereby improving data-reduction efficiency. DeepSketch uses a deep neural network to extract a data block's sketch, i.e., to create an approximate data signature of the block that can preserve similarity with other blocks. Our evaluation using eleven real-world workloads shows that DeepSketch improves the data-reduction ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication delta-compression technique.  ( 2 min )
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v1 [stat.ME])
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.  ( 2 min )
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v11 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history. To tackle this longitudinal treatment effect estimation problem, we have developed a causal dynamic survival (CDS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of past, time-dependent covariates and treatments. Using simulated survival datasets, the CDS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In two large clinical cohort studies, CDS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. CDS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions.  ( 2 min )
    Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales. (arXiv:2106.06418v2 [cs.CV] UPDATED)
    The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. In this paper, we present a systematic study of this methodology by implementing different types of scale channel networks and evaluating their ability to generalise to previously unseen scales. We develop a formalism for analysing the covariance and invariance properties of scale channel networks, and explore how different design choices, unique to scaling transformations, affect the overall performance of scale channel networks. We first show that two previously proposed scale channel network designs do not generalise well to scales not present in the training set. We explain theoretically and demonstrate experimentally why generalisation fails in these cases. We then propose a new type of foveated scale channel architecture}, where the scale channels process increasingly larger parts of the image with decreasing resolution. This new type of scale channel network is shown to generalise extremely well, provided sufficient image resolution and the absence of boundary effects. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improved performance when learning from datasets with large scale variations in the small sample regime.  ( 3 min )
    A Cram\'er Distance perspective on Quantile Regression based Distributional Reinforcement Learning. (arXiv:2110.00535v2 [stat.ML] UPDATED)
    Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR) based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cram\'er distance.  ( 2 min )
    Subtyping brain diseases from imaging data. (arXiv:2202.10945v1 [cs.LG])
    The imaging community has increasingly adopted machine learning (ML) methods to provide individualized imaging signatures related to disease diagnosis, prognosis, and response to treatment. Clinical neuroscience and cancer imaging have been two areas in which ML has offered particular promise. However, many neurologic and neuropsychiatric diseases, as well as cancer, are often heterogeneous in terms of their clinical manifestations, neuroanatomical patterns or genetic underpinnings. Therefore, in such cases, seeking a single disease signature might be ineffectual in delivering individualized precision diagnostics. The current chapter focuses on ML methods, especially semi-supervised clustering, that seek disease subtypes using imaging data. Work from Alzheimer Disease and its prodromal stages, psychosis, depression, autism, and brain cancer are discussed. Our goal is to provide the readers with a broad overview in terms of methodology and clinical applications.  ( 2 min )
    Stochastic Extragradient: General Analysis and Improved Rates. (arXiv:2111.08611v3 [math.OC] UPDATED)
    The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, several important questions regarding the convergence properties of SEG are still open, including the sampling of stochastic gradients, mini-batching, convergence guarantees for the monotone finite-sum variational inequalities with possibly non-monotone terms, and others. To address these questions, in this paper, we develop a novel theoretical framework that allows us to analyze several variants of SEG in a unified manner. Besides standard setups, like Same-Sample SEG under Lipschitzness and monotonicity or Independent-Samples SEG under uniformly bounded variance, our approach allows us to analyze variants of SEG that were never explicitly considered in the literature before. Notably, we analyze SEG with arbitrary sampling which includes importance sampling and various mini-batching strategies as special cases. Our rates for the new variants of SEG outperform the current state-of-the-art convergence guarantees and rely on less restrictive assumptions.  ( 2 min )
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v2 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.  ( 2 min )
    Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling. (arXiv:2104.04323v2 [cs.CV] UPDATED)
    This paper presents Contrastive Reconstruction, ConRec - a self-supervised learning algorithm that obtains image representations by jointly optimizing a contrastive and a self-reconstruction loss. We showcase that state-of-the-art contrastive learning methods (e.g. SimCLR) have shortcomings to capture fine-grained visual features in their representations. ConRec extends the SimCLR framework by adding (1) a self-reconstruction task and (2) an attention mechanism within the contrastive learning task. This is accomplished by applying a simple encoder-decoder architecture with two heads. We show that both extensions contribute towards an improved vector representation for images with fine-grained visual features. Combining those concepts, ConRec outperforms SimCLR and SimCLR with Attention-Pooling on fine-grained classification datasets.  ( 2 min )
    Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums. (arXiv:2105.12062v2 [math.OC] UPDATED)
    In convex optimization, the problem of finding near-stationary points has not been adequately studied yet, unlike other optimality measures such as the function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of algorithmic techniques for finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for minimizing both the gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future work.  ( 2 min )
    Distilled Neural Networks for Efficient Learning to Rank. (arXiv:2202.10728v1 [cs.LG])
    Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU. In this paper, we propose an approach for speeding up neural scoring time by applying a combination of Distillation, Pruning and Fast Matrix multiplication. We employ knowledge distillation to learn shallow neural networks from an ensemble of regression trees. Then, we exploit an efficiency-oriented pruning technique that performs a sparsification of the most computationally-intensive layers of the neural network that is then scored with optimized sparse matrix multiplication. Moreover, by studying both dense and sparse high performance matrix multiplication, we develop a scoring time prediction model which helps in devising neural network architectures that match the desired efficiency requirements. Comprehensive experiments on two public learning-to-rank datasets show that neural networks produced with our novel approach are competitive at any point of the effectiveness-efficiency trade-off when compared with tree-based ensembles, providing up to 4x scoring time speed-up without affecting the ranking quality.  ( 2 min )
    RAILS: A Robust Adversarial Immune-inspired Learning System. (arXiv:2107.02840v2 [cs.NE] UPDATED)
    Adversarial attacks against deep neural networks (DNNs) are continuously evolving, requiring increasingly powerful defense strategies. We develop a novel adversarial defense framework inspired by the adaptive immune system: the Robust Adversarial Immune-inspired Learning System (RAILS). Initializing a population of exemplars that is balanced across classes, RAILS starts from a uniform label distribution that encourages diversity and uses an evolutionary optimization process to adaptively adjust the predictive label distribution in a manner that emulates the way the natural immune system recognizes novel pathogens. RAILS' evolutionary optimization process explicitly captures the tradeoff between robustness (diversity) and accuracy (specificity) of the network, and represents a new immune-inspired perspective on adversarial learning. The benefits of RAILS are empirically demonstrated under eight types of adversarial attacks on a DNN adversarial image classifier for several benchmark datasets, including: MNIST; SVHN; CIFAR-10; and CIFAR-10. We find that PGD is the most damaging attack strategy and that for this attack RAILS is significantly more robust than other methods, achieving improvements in adversarial robustness by $\geq 5.62\%, 12.5\%$, $10.32\%$, and $8.39\%$, on these respective datasets, without appreciable loss of classification accuracy. Codes for the results in this paper are available at https://github.com/wangren09/RAILS.  ( 2 min )
    Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics. (arXiv:2202.10723v1 [cs.LG])
    Optimal transport (OT) is a popular measure to compare probability distributions. However, OT suffers a few drawbacks such as (i) a high complexity for computation, (ii) indefiniteness which limits its applicability to kernel machines. In this work, we consider probability measures supported on a graph metric space and propose a novel Sobolev transport metric. We show that the Sobolev transport metric yields a closed-form formula for fast computation and it is negative definite. We show that the space of probability measures endowed with this transport distance is isometric to a bounded convex set in a Euclidean space with a weighted $\ell_p$ distance. We further exploit the negative definiteness of the Sobolev transport to design positive-definite kernels, and evaluate their performances against other baselines in document classification with word embeddings and in topological data analysis.  ( 2 min )
    EIGNN: Efficient Infinite-Depth Graph Neural Networks. (arXiv:2202.10720v1 [cs.LG])
    Graph neural networks (GNNs) are widely used for modelling graph-structured data in numerous applications. However, with their inherently finite aggregation layers, existing GNN models may not be able to effectively capture long-range dependencies in the underlying graphs. Motivated by this limitation, we propose a GNN model with infinite depth, which we call Efficient Infinite-Depth Graph Neural Networks (EIGNN), to efficiently capture very long-range dependencies. We theoretically derive a closed-form solution of EIGNN which makes training an infinite-depth GNN model tractable. We then further show that we can achieve more efficient computation for training EIGNN by using eigendecomposition. The empirical results of comprehensive experiments on synthetic and real-world datasets show that EIGNN has a better ability to capture long-range dependencies than recent baselines, and consistently achieves state-of-the-art performance. Furthermore, we show that our model is also more robust against both noise and adversarial perturbations on node features.  ( 2 min )
    LOTR: Face Landmark Localization Using Localization Transformer. (arXiv:2109.10057v3 [cs.CV] UPDATED)
    This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-the-art methods. Additionally, we report the improvement in state-of-the-art face recognition performance when using our proposed LOTRs for face alignment.  ( 2 min )
    Minimax Regret for Partial Monitoring: Infinite Outcomes and Rustichini's Regret. (arXiv:2202.10997v1 [math.OC])
    We show that a version of the generalised information ratio of Lattimore and Gyorgy (2020) determines the asymptotic minimax regret for all finite-action partial monitoring games provided that (a) the standard definition of regret is used but the latent space where the adversary plays is potentially infinite; or (b) the regret introduced by Rustichini (1999) is used and the latent space is finite. Our results are complemented by a number of examples. For any $p \in [1/2,1]$ there exists an infinite partial monitoring game for which the minimax regret over $n$ rounds is $n^p$ up to subpolynomial factors and there exist finite games for which the minimax Rustichini regret is $n^{4/7}$ up to subpolynomial factors.  ( 2 min )
    Graph Neural Network for Traffic Forecasting: A Survey. (arXiv:2101.11174v4 [cs.LG] UPDATED)
    Traffic forecasting is important for the success of intelligent transportation systems. Deep learning models, including convolution neural networks and recurrent neural networks, have been extensively applied in traffic forecasting problems to model spatial and temporal dependencies. In recent years, to model the graph structures in transportation systems as well as contextual information, graph neural networks have been introduced and have achieved state-of-the-art performance in a series of traffic forecasting problems. In this survey, we review the rapidly growing body of research using different graph neural networks, e.g. graph convolutional and graph attention networks, in various traffic forecasting problems, e.g. road traffic flow and speed forecasting, passenger flow forecasting in urban rail transit systems, and demand forecasting in ride-hailing platforms. We also present a comprehensive list of open data and source resources for each problem and identify future research directions. To the best of our knowledge, this paper is the first comprehensive survey that explores the application of graph neural networks for traffic forecasting problems. We have also created a public GitHub repository where the latest papers, open data, and source resources will be updated.  ( 2 min )
    Approximate gradient ascent methods for distortion risk measures. (arXiv:2202.11046v1 [cs.LG])
    We propose approximate gradient ascent algorithms for risk-sensitive reinforcement learning control problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using distortion risk measure (DRM) of the cumulative discounted reward. Our algorithms estimate the DRM using order statistics of the cumulative rewards, and calculate approximate gradients from the DRM estimates using a smoothed functional-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Adversarial Defense by Latent Style Transformations. (arXiv:2006.09701v2 [cs.CV] UPDATED)
    Machine learning models have demonstrated vulnerability to adversarial attacks, more specifically misclassification of adversarial examples. In this paper, we investigate an attack-agnostic defense against adversarial attacks on high-resolution images by detecting suspicious inputs. The intuition behind our approach is that the essential characteristics of a normal image are generally consistent with non-essential style transformations, e.g., slightly changing the facial expression of human portraits. In contrast, adversarial examples are generally sensitive to such transformations. In our approach to detect adversarial instances, we propose an in\underline{V}ertible \underline{A}utoencoder based on the \underline{S}tyleGAN2 generator via \underline{A}dversarial training (VASA) to inverse images to disentangled latent codes that reveal hierarchical styles. We then build a set of edited copies with non-essential style transformations by performing latent shifting and reconstruction, based on the correspondences between latent codes and style transformations. The classification-based consistency of these edited copies is used to distinguish adversarial instances.  ( 2 min )
    Pyxis: An Open-Source Performance Dataset of Sparse Accelerators. (arXiv:2110.04280v2 [cs.LG] UPDATED)
    Specialized accelerators provide gains of performance and efficiency in specific domains of applications. Sparse data structures or/and representations exist in a wide range of applications. However, it is challenging to design accelerators for sparse applications because no architecture or performance-level analytic models are able to fully capture the spectrum of the sparse data. Accelerator researchers rely on real execution to get precise feedback for their designs. In this work, we present PYXIS, a performance dataset for specialized accelerators on sparse data. PYXIS collects accelerator designs and real execution performance statistics. Currently, there are 73.8 K instances in PYXIS. PYXIS is open-source, and we are constantly growing PYXIS with new accelerator designs and performance statistics. PYXIS can benefit researchers in the fields of accelerator, architecture, performance, algorithm, and many related topics.
    Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning. (arXiv:2112.15110v2 [cs.SD] UPDATED)
    Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality). To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.
    GRecX: An Efficient and Unified Benchmark for GNN-based Recommendation. (arXiv:2111.10342v3 [cs.IR] UPDATED)
    In this paper, we present GRecX, an open-source TensorFlow framework for benchmarking GNN-based recommendation models in an efficient and unified way. GRecX consists of core libraries for building GNN-based recommendation benchmarks, as well as the implementations of popular GNN-based recommendation models. The core libraries provide essential components for building efficient and unified benchmarks, including FastMetrics (efficient metrics computation libraries), VectorSearch (efficient similarity search libraries for dense vectors), BatchEval (efficient mini-batch evaluation libraries), and DataManager (unified dataset management libraries). Especially, to provide a unified benchmark for the fair comparison of different complex GNN-based recommendation models, we design a new metric GRMF-X and integrate it into the FastMetrics component. Based on a TensorFlow GNN library tf_geometric, GRecX carefully implements a variety of popular GNN-based recommendation models. We carefully implement these baseline models to reproduce the performance reported in the literature, and our implementations are usually more efficient and friendly. In conclusion, GRecX enables uses to train and benchmark GNN-based recommendation baselines in an efficient and unified way. We conduct experiments with GRecX, and the experimental results show that GRecX allows us to train and benchmark GNN-based recommendation baselines in an efficient and unified way. The source code of GRecX is available at https://github.com/maenzhier/GRecX.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v3 [cs.LG] UPDATED)
    Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the agent would incur when interacting with the Gaussian bandit, we bound the increase in regret when the agent interacts with the Bernoulli bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows at a rate which is at most linear in the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v2 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization and uncertainty calibration, while they struggle to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) and then deploying them for online prediction incur an extremely higher computational overhead of model training and test-time predictions. Recently, several advanced techniques, such as fast geometric ensembling (FGE) and snapshot ensemble, have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their overhead of model recording and test-time computations remains much higher than their single model based counterparts. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs generated by several successively-performed procedures of stochastic weight averaging. Experimental results across different advanced DNN architectures on different datasets, namely CIFAR-$\{$10,100$\}$ and Imagenet, demonstrate its performance. Results show that, compared with state-of-the-art methods, PFGE achieves better generalization performance and satisfactory calibration capability, while the overhead of model recording and test-time predictions is significantly reduced.
    Don't Touch What Matters: Task-Aware Lipschitz Data Augmentation for Visual Reinforcement Learning. (arXiv:2202.09982v2 [cs.CV] UPDATED)
    One of the key challenges in visual Reinforcement Learning (RL) is to learn policies that can generalize to unseen environments. Recently, data augmentation techniques aiming at enhancing data diversity have demonstrated proven performance in improving the generalization ability of learned policies. However, due to the sensitivity of RL training, naively applying data augmentation, which transforms each pixel in a task-agnostic manner, may suffer from instability and damage the sample efficiency, thus further exacerbating the generalization performance. At the heart of this phenomenon is the diverged action distribution and high-variance value estimation in the face of augmented images. To alleviate this issue, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels. To verify the effectiveness of TLDA, we conduct extensive experiments on DeepMind Control suite, CARLA and DeepMind Manipulation tasks, showing that TLDA improves both sample efficiency in training time and generalization in test time. It outperforms previous state-of-the-art methods across the 3 different visual control benchmarks.
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v1 [cs.CV] CROSS LISTED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    Hyper Attention Recurrent Neural Network: Tackling Temporal Covariate Shift in Time Series Analysis. (arXiv:2202.10808v1 [cs.LG])
    Analyzing long time series with RNNs often suffers from infeasible training. Segmentation is therefore commonly used in data pre-processing. However, in non-stationary time series, there exists often distribution shift among different segments. RNN is easily swamped in the dilemma of fitting bias in these segments due to the lack of global information, leading to poor generalization, known as Temporal Covariate Shift (TCS) problem, which is only addressed by a recently proposed RNN-based model. One of the assumptions in TCS is that the distribution of all divided intervals under the same segment are identical. This assumption, however, may not be true on high-frequency time series, such as traffic flow, that also have large stochasticity. Besides, macro information across long periods isn't adequately considered in the latest RNN-based methods. To address the above issues, we propose Hyper Attention Recurrent Neural Network (HARNN) for the modeling of temporal patterns containing both micro and macro information. An HARNN consists of a meta layer for parameter generation and an attention-enabled main layer for inference. High-frequency segments are transformed into low-frequency segments and fed into the meta layers, while the first main layer consumes the same high-frequency segments as conventional methods. In this way, each low-frequency segment in the meta inputs generates a unique main layer, enabling the integration of both macro information and micro information for inference. This forces all main layers to predict the same target which fully harnesses the common knowledge in varied distributions when capturing temporal patterns. Evaluations on multiple benchmarks demonstrated that our model outperforms a couple of RNN-based methods on a federation of key metrics.  ( 2 min )
    Differentially Private Federated Learning on Heterogeneous Data. (arXiv:2111.09278v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a paradigm for large-scale distributed learning which faces two key challenges: (i) efficient training from highly heterogeneous user data, and (ii) protecting the privacy of participating users. In this work, we propose a novel FL approach (DP-SCAFFOLD) to tackle these two challenges together by incorporating Differential Privacy (DP) constraints into the popular SCAFFOLD algorithm. We focus on the challenging setting where users communicate with a "honest-but-curious" server without any trusted intermediary, which requires to ensure privacy not only towards a third-party with access to the final model but also towards the server who observes all user communications. Using advanced results from DP theory, we establish the convergence of our algorithm for convex and non-convex objectives. Our analysis clearly highlights the privacy-utility trade-off under data heterogeneity, and demonstrates the superiority of DP-SCAFFOLD over the state-of-the-art algorithm DP-FedAvg when the number of local updates and the level of heterogeneity grow. Our numerical results confirm our analysis and show that DP-SCAFFOLD provides significant gains in practice.
    How I Learned to Stop Worrying and Love Retraining. (arXiv:2111.00843v2 [cs.LG] UPDATED)
    Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. (2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of 'hard' pruning by incorporating the sparsification process into the standard training.
    Generating Synthetic Mobility Networks with Generative Adversarial Networks. (arXiv:2202.11028v1 [cs.LG])
    The increasingly crucial role of human displacements in complex societal phenomena, such as traffic congestion, segregation, and the diffusion of epidemics, is attracting the interest of scientists from several disciplines. In this article, we address mobility network generation, i.e., generating a city's entire mobility network, a weighted directed graph in which nodes are geographic locations and weighted edges represent people's movements between those locations, thus describing the entire mobility set flows within a city. Our solution is MoGAN, a model based on Generative Adversarial Networks (GANs) to generate realistic mobility networks. We conduct extensive experiments on public datasets of bike and taxi rides to show that MoGAN outperforms the classical Gravity and Radiation models regarding the realism of the generated networks. Our model can be used for data augmentation and performing simulations and what-if analysis.  ( 2 min )
    Choquet-Based Fuzzy Rough Sets. (arXiv:2202.10872v1 [cs.LG])
    Fuzzy rough set theory can be used as a tool for dealing with inconsistent data when there is a gradual notion of indiscernibility between objects. It does this by providing lower and upper approximations of concepts. In classical fuzzy rough sets, the lower and upper approximations are determined using the minimum and maximum operators, respectively. This is undesirable for machine learning applications, since it makes these approximations sensitive to outlying samples. To mitigate this problem, ordered weighted average (OWA) based fuzzy rough sets were introduced. In this paper, we show how the OWA-based approach can be interpreted intuitively in terms of vague quantification, and then generalize it to Choquet-based fuzzy rough sets (CFRS). This generalization maintains desirable theoretical properties, such as duality and monotonicity. Furthermore, it provides more flexibility for machine learning applications. In particular, we show that it enables the seamless integration of outlier detection algorithms, to enhance the robustness of machine learning algorithms based on fuzzy rough sets.  ( 2 min )
    Wavebender GAN: An architecture for phonetically meaningful speech manipulation. (arXiv:2202.10973v1 [eess.AS])
    Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies. This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.  ( 2 min )
    Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE). (arXiv:2202.10979v1 [cs.LG])
    Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.  ( 2 min )
    Adaptive Cut Selection in Mixed-Integer Linear Programming. (arXiv:2202.10962v1 [math.OC])
    Cut selection is a subroutine used in all modern mixed-integer linear programming solvers with the goal of selecting a subset of generated cuts that induce optimal solver performance. These solvers have millions of parameter combinations, and so are excellent candidates for parameter tuning. Cut selection scoring rules are usually weighted sums of different measurements, where the weights are parameters. We present a parametric family of mixed-integer linear programs together with infinitely many family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. We show for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our problems. We propose a variation on the design of existing graph convolutional neural networks, adapting them to learn cut selection rule parameters. We present a reinforcement learning framework for selecting cuts, and train our design using said framework over MIPLIB 2017. Our framework and design show that adaptive cut selection does substantially improve performance over a diverse set of instances, but that finding a single function describing such a rule is difficult. Code for reproducing all experiments is available at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP.  ( 2 min )
    The Winning Solution to the iFLYTEK Challenge 2021 Cultivated Land Extraction from High-Resolution Remote Sensing Image. (arXiv:2202.10974v1 [cs.CV])
    Extracting cultivated land accurately from high-resolution remote images is a basic task for precision agriculture. This report introduces our solution to the iFLYTEK challenge 2021 cultivated land extraction from high-resolution remote sensing image. The challenge requires segmenting cultivated land objects in very high-resolution multispectral remote sensing images. We established a highly effective and efficient pipeline to solve this problem. We first divided the original images into small tiles and separately performed instance segmentation on each tile. We explored several instance segmentation algorithms that work well on natural images and developed a set of effective methods that are applicable to remote sensing images. Then we merged the prediction results of all small tiles into seamless, continuous segmentation results through our proposed overlap-tile fusion strategy. We achieved the first place among 486 teams in the challenge.  ( 2 min )
    How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization. (arXiv:2111.08706v2 [cs.NE] UPDATED)
    The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.
    Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks. (arXiv:2108.00131v2 [cs.LG] UPDATED)
    Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications, but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width neural network framework for matrix completion that is simple, fast, and flexible. Simplicity and speed come from the connection between the infinite width limit of neural networks and kernels known as neural tangent kernels (NTK). In particular, we derive the NTK for fully connected and convolutional neural networks for matrix completion. The flexibility stems from a feature prior, which allows encoding relationships between coordinates of the target matrix, akin to semi-supervised learning. The effectiveness of our framework is demonstrated through competitive results for virtual drug screening and image inpainting/reconstruction. We also provide an implementation in Python to make our framework accessible on standard hardware to a broad audience.
    Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations. (arXiv:2202.10951v1 [cs.LG])
    In variational inference (VI), the marginal log-likelihood is estimated using the standard evidence lower bound (ELBO), or improved versions as the importance weighted ELBO (IWELBO). We propose the multiple importance sampling ELBO (MISELBO), a \textit{versatile} yet \textit{simple} framework. MISELBO is applicable in both amortized and classical VI, and it uses ensembles, e.g., deep ensembles, of independently inferred variational approximations. As far as we are aware, the concept of deep ensembles in amortized VI has not previously been established. We prove that MISELBO provides a tighter bound than the average of standard ELBOs, and demonstrate empirically that it gives tighter bounds than the average of IWELBOs. MISELBO is evaluated in density-estimation experiments that include MNIST and several real-data phylogenetic tree inference problems. First, on the MNIST dataset, MISELBO boosts the density-estimation performances of a state-of-the-art model, nouveau VAE. Second, in the phylogenetic tree inference setting, our framework enhances a state-of-the-art VI algorithm that uses normalizing flows. On top of the technical benefits of MISELBO, it allows to unveil connections between VI and recent advances in the importance sampling literature, paving the way for further methodological advances. We provide our code at \url{https://github.com/Lagergren-Lab/MISELBO}.  ( 2 min )
    Disentangling causal effects for hierarchical reinforcement learning. (arXiv:2010.01351v2 [cs.AI] UPDATED)
    Exploration and credit assignment under sparse rewards are still challenging problems. We argue that these challenges arise in part due to the intrinsic rigidity of operating at the level of actions. Actions can precisely define how to perform an activity but are ill-suited to describe what activity to perform. Instead, causal effects are inherently composable and temporally abstract, making them ideal for descriptive tasks. By leveraging a hierarchy of causal effects, this study aims to expedite the learning of task-specific behavior and aid exploration. Borrowing counterfactual and normality measures from causal literature, we disentangle controllable effects from effects caused by other dynamics of the environment. We propose CEHRL, a hierarchical method that models the distribution of controllable effects using a Variational Autoencoder. This distribution is used by a high-level policy to 1) explore the environment via random effect exploration so that novel effects are continuously discovered and learned, and to 2) learn task-specific behavior by prioritizing the effects that maximize a given reward function. In comparison to exploring with random actions, experimental results show that random effect exploration is a more efficient mechanism and that by assigning credit to few effects rather than many actions, CEHRL learns tasks more rapidly.
    SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions. (arXiv:2202.10451v1 [cs.LG])
    Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
    VU-BERT: A Unified framework for Visual Dialog. (arXiv:2202.10787v1 [cs.CL])
    The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.
    Submodlib: A Submodular Optimization Library. (arXiv:2202.10680v1 [cs.LG])
    Submodular functions are a special class of set functions which naturally model the notion of representativeness, diversity, coverage etc. and have been shown to be computationally very efficient. A lot of past work has applied submodular optimization to find optimal subsets in various contexts. Some examples include data summarization for efficient human consumption, finding effective smaller subsets of training data to reduce the model development time (training, hyper parameter tuning), finding effective subsets of unlabeled data to reduce the labeling costs, etc. A recent work has also leveraged submodular functions to propose submodular information measures which have been found to be very useful in solving the problems of guided subset selection and guided summarization. In this work, we present Submodlib which is an open-source, easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine. Submodlib finds its application in summarization, data subset selection, hyper parameter tuning, efficient training and more. Through a rich API, it offers a great deal of flexibility in the way it can be used.
    Quantum Differential Privacy: An Information Theory Perspective. (arXiv:2202.10717v1 [quant-ph])
    Differential privacy has been an exceptionally successful concept when it comes to providing provable security guarantees for classical computations. More recently, the concept was generalized to quantum computations. While classical computations are essentially noiseless and differential privacy is often achieved by artificially adding noise, near-term quantum computers are inherently noisy and it was observed that this leads to natural differential privacy as a feature. In this work we discuss quantum differential privacy in an information theoretic framework by casting it as a quantum divergence. A main advantage of this approach is that differential privacy becomes a property solely based on the output states of the computation, without the need to check it for every measurement. This leads to simpler proofs and generalized statements of its properties as well as several new bounds for both, general and specific, noise models. In particular, these include common representations of quantum circuits and quantum machine learning concepts. Here, we focus on the difference in the amount of noise required to achieve certain levels of differential privacy versus the amount that would make any computation useless. Finally, we also generalize the classical concepts of local differential privacy, R\'enyi differential privacy and the hypothesis testing interpretation to the quantum setting, providing several new properties and insights.
    Classical versus Quantum: comparing Tensor Network-based Quantum Circuits on LHC data. (arXiv:2202.10471v1 [quant-ph])
    Tensor Networks (TN) are approximations of high-dimensional tensors designed to represent locally entangled quantum many-body systems efficiently. This study provides a comprehensive comparison between classical TNs and TN-inspired quantum circuits in the context of Machine Learning on highly complex, simulated LHC data. We show that classical TNs require exponentially large bond dimensions and higher Hilbert-space mapping to perform comparably to their quantum counterparts. While such an expansion in the dimensionality allows better performance, we observe that, with increased dimensionality, classical TNs lead to a highly flat loss landscape, rendering the usage of gradient-based optimization methods highly challenging. Furthermore, by employing quantitative metrics, such as the Fisher information and effective dimensions, we show that classical TNs require a more extensive training sample to represent the data as efficiently as TN-inspired quantum circuits. We also engage with the idea of hybrid classical-quantum TNs and show possible architectures to employ a larger phase-space from the data. We offer our results using three main TN ansatz: Tree Tensor Networks, Matrix Product States, and Multi-scale Entanglement Renormalisation Ansatz.
    Hierarchical Deep Generative Models for Design Under Free-Form Geometric Uncertainty. (arXiv:2202.10558v1 [cs.CE])
    Deep generative models have demonstrated effectiveness in learning compact and expressive design representations that significantly improve geometric design optimization. However, these models do not consider the uncertainty introduced by manufacturing or fabrication. Past work that quantifies such uncertainty often makes simplifying assumptions on geometric variations, while the "real-world", "free-form" uncertainty and its impact on design performance are difficult to quantify due to the high dimensionality. To address this issue, we propose a Generative Adversarial Network-based Design under Uncertainty Framework (GAN-DUF), which contains a deep generative model that simultaneously learns a compact representation of nominal (ideal) designs and the conditional distribution of fabricated designs given any nominal design. This opens up new possibilities of 1)~building a universal uncertainty quantification model compatible with both shape and topological designs, 2)~modeling free-form geometric uncertainties without the need to make any assumptions on the distribution of geometric variability, and 3)~allowing fast prediction of uncertainties for new nominal designs. We can combine the proposed deep generative model with robust design optimization or reliability-based design optimization for design under uncertainty. We demonstrated the framework on two real-world engineering design examples and showed its capability of finding the solution that possesses better performances after fabrication.
    A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning. (arXiv:2202.10464v1 [cs.NE])
    Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize stochastic constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.
    EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization. (arXiv:2202.10935v1 [cs.LG])
    Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy efficiency, respectively.
    Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. (arXiv:2202.10571v1 [cs.CV])
    In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DIGAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v2 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm that uses a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the work of Na et al. (2021) by allowing nonlinear inequality constraints without requiring the strict complementary condition. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in CUTEst test set.
    A Novel Convergence Analysis for Algorithms of the Adam Family and Beyond. (arXiv:2104.14840v4 [math.OC] UPDATED)
    Why does the original analysis of Adam fail, but it still converges very well in practice on a broad range of problems? There are still some mysteries about Adam that have not been unraveled. This paper provides a novel non-convex analysis of Adam and its many variants to uncover some of these mysteries. Our analysis exhibits that an increasing or large enough "momentum" parameter for the first-order moment used in practice is sufficient to ensure Adam and its many variants converge under a mild boundness condition on the adaptive scaling factor of the step size. In contrast, the original problematic analysis of Adam uses a momentum parameter that decreases to zero, which is the key reason that makes it diverge on some problems. To the best of our knowledge, this is the first time the gap between analysis and practice is bridged. Our analysis also exhibits more insights for practical implementations of Adam, e.g., increasing the momentum parameter in a stage-wise manner in accordance with stagewise decreasing step size would help improve the convergence. Our analysis of the Adam family is modular such that it can be (has been) extended to solving other optimization problems, e.g., compositional, min-max and bi-level problems. As an interesting yet non-trivial use case, we present an extension for solving non-convex min-max optimization in order to address a gap in the literature that either requires a large batch or has double loops. Our empirical studies corroborate the theory and also demonstrate the effectiveness in solving min-max problems.
    TENT: Tensorized Encoder Transformer for Temperature Forecasting. (arXiv:2106.14742v2 [cs.LG] UPDATED)
    Reliable weather forecasting is of great importance in science, business, and society. The best performing data-driven models for weather prediction tasks rely on recurrent or convolutional neural networks, where some of which incorporate attention mechanisms. In this work, we introduce a novel model based on Transformer architecture for weather forecasting. The proposed Tensorial Encoder Transformer (TENT) model is equipped with tensorial attention and thus it exploits the spatiotemporal structure of weather data by processing it in multidimensional tensorial format. We show that compared to the classical encoder transformer, 3D convolutional neural networks, LSTM, and Convolutional LSTM, the proposed TENT model can better learn the underlying complex pattern of the weather data for the studied temperature prediction task. Experiments on two real-life weather datasets are performed. The datasets consist of historical measurements from weather stations in the USA, Canada and Europe. The first dataset contains hourly measurements of weather attributes for 30 cities in the USA and Canada from October 2012 to November 2017. The second dataset contains daily measurements of weather attributes of 18 cities across Europe from May 2005 to April 2020. Two attention scores are introduced based on the obtained tonsorial attention and are visualized in order to shed light on the decision-making process of our model and provide insight knowledge on the most important cities for the target cities.
    CoBERL: Contrastive BERT for Reinforcement Learning. (arXiv:2107.05431v2 [cs.LG] UPDATED)
    Many reinforcement learning (RL) agents require a large amount of experience to solve tasks. We propose Contrastive BERT for RL (CoBERL), an agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency. CoBERL enables efficient, robust learning from pixels across a wide range of domains. We use bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations. We find that CoBERL consistently improves performance across the full Atari suite, a set of control tasks and a challenging 3D environment.
    Local SGD Optimizes Overparameterized Neural Networks in Polynomial Time. (arXiv:2107.10868v2 [cs.LG] UPDATED)
    In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communication-efficient distributed optimization, its convergence on non-smooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with non-smooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on non-smooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.
    Adversarial Tracking Control via Strongly Adaptive Online Learning with Memory. (arXiv:2102.01623v3 [cs.LG] UPDATED)
    We consider the problem of tracking an adversarial state sequence in a linear dynamical system subject to adversarial disturbances and loss functions, generalizing earlier settings in the literature. To this end, we develop three techniques, each of independent interest. First, we propose a comparator-adaptive algorithm for online linear optimization with movement cost. Without tuning, it nearly matches the performance of the optimally tuned gradient descent in hindsight. Next, considering a related problem called online learning with memory, we construct a novel strongly adaptive algorithm that uses our first contribution as a building block. Finally, we present the first reduction from adversarial tracking control to strongly adaptive online learning with memory. Summarizing these individual techniques, we obtain an adversarial tracking controller with a strong performance guarantee even when the reference trajectory has a large range of movement.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v2 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which unobserved factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with explanatory insights. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and two empirical databases: 1) the Clinical Practice Research Datalink (CPRD) and 2) the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing the continuous treatment effect with observational data in a medical context.
    Strong replica symmetry for high-dimensional disordered log-concave Gibbs measures. (arXiv:2009.12939v3 [math.PR] UPDATED)
    We consider a generic class of log-concave, possibly random, (Gibbs) measures. We prove the concentration of an infinite family of order parameters called multioverlaps. Because they completely parametrise the quenched Gibbs measure of the system, this implies a simple representation of the asymptotic Gibbs measures, as well as the decoupling of the variables in a strong sense. These results may prove themselves useful in several contexts. In particular in machine learning and high-dimensional inference, log-concave measures appear in convex empirical risk minimisation, maximum a-posteriori inference or M-estimation. We believe that they may be applicable in establishing some type of "replica symmetric formulas" for the free energy, inference or generalisation error in such settings.
    Computing Multiple Image Reconstructions with a Single Hypernetwork. (arXiv:2202.11009v1 [cs.CV])
    Deep learning based techniques achieve state-of-the-art results in a wide range of image reconstruction tasks like compressed sensing. These methods almost always have hyperparameters, such as the weight coefficients that balance the different terms in the optimized loss function. The typical approach is to train the model for a hyperparameter setting determined with some empirical or theoretical justification. Thus, at inference time, the model can only compute reconstructions corresponding to the pre-determined hyperparameter values. In this work, we present a hypernetwork based approach, called HyperRecon, to train reconstruction models that are agnostic to hyperparameter settings. At inference time, HyperRecon can efficiently produce diverse reconstructions, which would each correspond to different hyperparameter values. In this framework, the user is empowered to select the most useful output(s) based on their own judgement. We demonstrate our method in compressed sensing, super-resolution and denoising tasks, using two large-scale and publicly-available MRI datasets. Our code is available at https://github.com/alanqrwang/hyperrecon.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v1 [cs.LG])
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Causal reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    ReorientBot: Learning Object Reorientation for Specific-Posed Placement. (arXiv:2202.11092v1 [cs.RO])
    Robots need the capability of placing objects in arbitrary, specific poses to rearrange the world and achieve various valuable tasks. Object reorientation plays a crucial role in this as objects may not initially be oriented such that the robot can grasp and then immediately place them in a specific goal pose. In this work, we present a vision-based manipulation system, ReorientBot, which consists of 1) visual scene understanding with pose estimation and volumetric reconstruction using an onboard RGB-D camera; 2) learned waypoint selection for successful and efficient motion generation for reorientation; 3) traditional motion planning to generate a collision-free trajectory from the selected waypoints. We evaluate our method using the YCB objects in both simulation and the real world, achieving 93% overall success, 81% improvement in success rate, and 22% improvement in execution time compared to a heuristic approach. We demonstrate extended multi-object rearrangement showing the general capability of the system.
    MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an Exponential Convergence Rate. (arXiv:2202.10923v1 [stat.ML])
    The fluctuation effect of gradient expectation and variance caused by parameter update between consecutive iterations is neglected or confusing by current mainstream gradient optimization algorithms.Using this fluctuation effect, combined with the stratified sampling strategy, this paper designs a novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient Descend(\underline{MST}GD) algorithm with an exponential convergence rate. Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient, which is estimated from the mean and variance of sample gradients before and after iteration, and the other strategy is stratified sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two strategies can be adaptively unbiased, and its variance decays at a geometric rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential convergence rate of the form $\lambda^{2(k-k_0)}$($\lambda\in (0,1)$,k is the number of iteration steps,$\lambda$ is a variable related to proportion p).Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc., and can be achieved at a constant step size.Theoretical and experimental results show the effectiveness of MSTGD
    Letters of the Alphabet: Discovering Natural Feature Sets. (arXiv:2202.10934v1 [cs.LG])
    Deep learning networks find intricate features in large datasets using the backpropagation algorithm. This algorithm repeatedly adjusts the network connections.' weights and examining the "hidden" nodes behavior between the input and output layer provides better insight into how neural networks create feature representations. Experiments built on each other show that activity differences computed within a layer can guide learning. A simple neural network is used, which includes a data set comprised of the alphabet letters, where each letter forms 81 input nodes comprised of 0 and 1s and a single hidden layer and an output layer. The first experiment explains how the hidden layers in this simple neural network represent the input data's features. The second experiment attempts to reverse-engineer the neural network to find the alphabet's natural feature sets. As the network interprets features, we can understand how it derives the natural feature sets for a given data. This understanding is essential to delve deeper into deep generative models, such as Boltzmann machines. Deep generative models are a class of unsupervised deep learning algorithms. The primary function of deep generative models is to find the natural feature sets for a given data set.
    Convex Loss Functions for Contextual Pricing with Observational Posted-Price Data. (arXiv:2202.10944v1 [cs.LG])
    We study an off-policy contextual pricing problem where the seller has access to samples of prices which customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the historic pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this pricing setting which can be directly optimized to find an effective pricing policy with expected revenue guarantees without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We further propose generalized hinge and quantile pricing loss functions, which price at a multiplicative factor of the conditional expected value or a particular quantile of the valuation distribution when optimized, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.
    A framework for spatial heat risk assessment using a generalized similarity measure. (arXiv:2202.10963v1 [cs.LG])
    In this study, we develop a novel framework to assess health risks due to heat hazards across various localities (zip codes) across the state of Maryland with the help of two commonly used indicators i.e. exposure and vulnerability. Our approach quantifies each of the two aforementioned indicators by developing their corresponding feature vectors and subsequently computes indicator-specific reference vectors that signify a high risk environment by clustering the data points at the tail-end of an empirical risk spectrum. The proposed framework circumvents the information-theoretic entropy based aggregation methods whose usage varies with different views of entropy that are subjective in nature and more importantly generalizes the notion of risk-valuation using cosine similarity with unknown reference points.
    StickyLand: Breaking the Linear Presentation of Computational Notebooks. (arXiv:2202.11086v1 [cs.HC])
    How can we better organize code in computational notebooks? Notebooks have become a popular tool among data scientists, as they seamlessly weave text and code together, supporting users to rapidly iterate and document code experiments. However, it is often challenging to organize code in notebooks, partially because there is a mismatch between the linear presentation of code and the non-linear process of exploratory data analysis. We present StickyLand, a notebook extension for empowering users to freely organize their code in non-linear ways. With sticky cells that are always shown on the screen, users can quickly access their notes, instantly observe experiment results, and easily build interactive dashboards that support complex visual analytics. Case studies highlight how our tool can enhance notebook users's productivity and identify opportunities for future notebook designs. StickyLand is available at https://github.com/xiaohk/stickyland.
    Temporal Subtyping of Alzheimer's Disease Using Medical Conditions Preceding Alzheimer's Disease Onset in Electronic Health Records. (arXiv:2202.10991v1 [cs.LG])
    Subtyping of Alzheimer's disease (AD) can facilitate diagnosis, treatment, prognosis and disease management. It can also support the testing of new prevention and treatment strategies through clinical trials. In this study, we employed spectral clustering to cluster 29,922 AD patients in the OneFlorida Data Trust using their longitudinal EHR data of diagnosis and conditions into four subtypes. These subtypes exhibit different patterns of progression of other conditions prior to the first AD diagnosis. In addition, according to the results of various statistical tests, these subtypes are also significantly different with respect to demographics, mortality, and prescription medications after the AD diagnosis. This study could potentially facilitate early detection and personalized treatment of AD as well as data-driven generalizability assessment of clinical trials for AD.
    POLE: Polarized Embedding for Signed Networks. (arXiv:2110.09899v3 [cs.SI] UPDATED)
    From the 2016 U.S. presidential election to the 2021 Capitol riots to the spread of misinformation related to COVID-19, many have blamed social media for today's deeply divided society. Recent advances in machine learning for signed networks hold the promise to guide small interventions with the goal of reducing polarization in social media. However, existing models are especially ineffective in predicting conflicts (or negative links) among users. This is due to a strong correlation between link signs and the network structure, where negative links between polarized communities are too sparse to be predicted even by state-of-the-art approaches. To address this problem, we first design a partition-agnostic polarization measure for signed graphs based on the signed random-walk and show that many real-world graphs are highly polarized. Then, we propose POLE (POLarized Embedding for signed networks), a signed embedding method for polarized graphs that captures both topological and signed similarities jointly via signed autocovariance. Through extensive experiments, we show that POLE significantly outperforms state-of-the-art methods in signed link prediction, particularly for negative links with gains of up to one order of magnitude.
    Probing GNN Explainers: A Rigorous Theoretical and Empirical Analysis of GNN Explanation Methods. (arXiv:2106.09078v2 [cs.LG] UPDATED)
    As Graph Neural Networks (GNNs) are increasingly being employed in critical real-world applications, several methods have been proposed in recent literature to explain the predictions of these models. However, there has been little to no work on systematically analyzing the reliability of these methods. Here, we introduce the first-ever theoretical analysis of the reliability of state-of-the-art GNN explanation methods. More specifically, we theoretically analyze the behavior of various state-of-the-art GNN explanation methods with respect to several desirable properties (e.g., faithfulness, stability, and fairness preservation) and establish upper bounds on the violation of these properties. We also empirically validate our theoretical results using extensive experimentation with nine real-world graph datasets. Our empirical results further shed light on several interesting insights about the behavior of state-of-the-art GNN explanation methods.
    Harmless interpolation in regression and classification with structured features. (arXiv:2111.05198v2 [stat.ML] UPDATED)
    Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.
    Efficient Wind Speed Nowcasting with GPU-Accelerated Nearest Neighbors Algorithm. (arXiv:2112.10408v2 [cs.LG] UPDATED)
    This paper proposes a simple yet efficient high-altitude wind nowcasting pipeline. It processes efficiently a vast amount of live data recorded by airplanes over the whole airspace and reconstructs the wind field with good accuracy. It creates a unique context for each point in the dataset and then extrapolates from it. As creating such context is computationally intensive, this paper proposes a novel algorithm that reduces the time and memory cost by efficiently fetching nearest neighbors in a data set whose elements are organized along smooth trajectories that can be approximated with piece-wise linear structures. We introduce an efficient and exact strategy implemented through algebraic tensorial operations, which is well-suited to modern GPU-based computing infrastructure. This method employs a scalable Euclidean metric and allows masking data points along one dimension. When applied, this method is more efficient than plain Euclidean k-NN and other well-known data selection methods such as KDTrees and provides a several-fold speedup. We provide an implementation in PyTorch and a novel data set to allow the replication of empirical results.
    Recognizing Concepts and Recognizing Musical Themes. A Quantum Semantic Analysis. (arXiv:2202.10941v1 [cs.LG])
    How are abstract concepts and musical themes recognized on the basis of some previous experience? It is interesting to compare the different behaviors of human and of artificial intelligences with respect to this problem. Generally, a human mind that abstracts a concept (say, table) from a given set of known examples creates a table-Gestalt: a kind of vague and out of focus image that does not fully correspond to a particular table with well determined features. A similar situation arises in the case of musical themes. Can the construction of a gestaltic pattern, which is so natural for human minds, be taught to an intelligent machine? This problem can be successfully discussed in the framework of a quantum approach to pattern recognition and to machine learning. The basic idea is replacing classical data sets with quantum data sets, where either objects or musical themes can be formally represented as pieces of quantum information, involving the uncertainties and the ambiguities that characterize the quantum world. In this framework, the intuitive concept of Gestalt can be simulated by the mathematical concept of positive centroid of a given quantum data set. Accordingly, the crucial problem "how can we classify a new object or a new musical theme (we have listened to) on the basis of a previous experience?" can be dealt with in terms of some special quantum similarity-relations. Although recognition procedures are different for human and for artificial intelligences, there is a common method of "facing the problems" that seems to work in both cases.
    Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future. (arXiv:2106.04420v2 [cs.LG] UPDATED)
    In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too; hence policy-makers can better understand the true accuracy of forecasting models in real-time.
    Learning Competitive Equilibria in Exchange Economies with Bandit Feedback. (arXiv:2106.06616v2 [cs.LG] UPDATED)
    The sharing of scarce resources among multiple rational agents is one of the classical problems in economics. In exchange economies, which are used to model such situations, agents begin with an initial endowment of resources and exchange them in a way that is mutually beneficial until they reach a competitive equilibrium (CE). The allocations at a CE are Pareto efficient and fair. Consequently, they are used widely in designing mechanisms for fair division. However, computing CEs requires the knowledge of agent preferences which are unknown in several applications of interest. In this work, we explore a new online learning mechanism, which, on each round, allocates resources to the agents and collects stochastic feedback on their experience in using that allocation. Its goal is to learn the agent utilities via this feedback and imitate the allocations at a CE in the long run. We quantify CE behavior via two losses and propose a randomized algorithm which achieves sublinear loss under a parametric class of utilities. Empirically, we demonstrate the effectiveness of this mechanism through numerical simulations.
    Better Private Algorithms for Correlation Clustering. (arXiv:2202.10747v1 [cs.LG])
    In machine learning, correlation clustering is an important problem whose goal is to partition the individuals into groups that correlate with their pairwise similarities as much as possible. In this work, we revisit the correlation clustering under the differential privacy constraints. Particularly, we improve previous results and achieve an $\Tilde{O}(n^{1.5})$ additive error compared to the optimal cost in expectation on general graphs. As for unweighted complete graphs, we improve the results further and propose a more involved algorithm which achieves $\Tilde{O}(n \sqrt{\Delta^*})$ additive error, where $\Delta^*$ is the maximum degrees of positive edges among all nodes.
    Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness. (arXiv:2202.10816v1 [cs.LG])
    In addition to reproducing discriminatory relationships in the training data, machine learning systems can also introduce or amplify discriminatory effects. We refer to this as introduced unfairness, and investigate the conditions under which it may arise. To this end, we propose introduced total variation as a measure of introduced unfairness, and establish graphical conditions under which it may be incentivised to occur. These criteria imply that adding the sensitive attribute as a feature removes the incentive for introduced variation under well-behaved loss functions. Additionally, taking a causal perspective, introduced path-specific effects shed light on the issue of when specific paths should be considered fair.
    Neural Ensemble Search for Uncertainty Estimation and Dataset Shift. (arXiv:2006.08573v3 [cs.LG] UPDATED)
    Ensembles of neural networks achieve superior performance compared to stand-alone networks in terms of accuracy, uncertainty calibration and robustness to dataset shift. \emph{Deep ensembles}, a state-of-the-art method for uncertainty estimation, only ensemble random initializations of a \emph{fixed} architecture. Instead, we propose two methods for automatically constructing ensembles with \emph{varying} architectures, which implicitly trade-off individual architectures' strengths against the ensemble's diversity and exploit architectural variation as a source of diversity. On a variety of classification tasks and modern architecture search spaces, we show that the resulting ensembles outperform deep ensembles not only in terms of accuracy but also uncertainty calibration and robustness to dataset shift. Our further analysis and ablation studies provide evidence of higher ensemble diversity due to architectural variation, resulting in ensembles that can outperform deep ensembles, even when having weaker average base learners. To foster reproducibility, our code is available: \url{https://github.com/automl/nes}
    Efficient and Differentiable Conformal Prediction with General Function Classes. (arXiv:2202.11091v1 [cs.LG])
    Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.
    No-Regret Learning in Partially-Informed Auctions. (arXiv:2202.10606v1 [cs.LG])
    Auctions with partially-revealed information about items are broadly employed in real-world applications, but the underlying mechanisms have limited theoretical support. In this work, we study a machine learning formulation of these types of mechanisms, presenting algorithms that are no-regret from the buyer's perspective. Specifically, a buyer who wishes to maximize his utility interacts repeatedly with a platform over a series of $T$ rounds. In each round, a new item is drawn from an unknown distribution and the platform publishes a price together with incomplete, "masked" information about the item. The buyer then decides whether to purchase the item. We formalize this problem as an online learning task where the goal is to have low regret with respect to a myopic oracle that has perfect knowledge of the distribution over items and the seller's masking function. When the distribution over items is known to the buyer and the mask is a SimHash function mapping $\mathbb{R}^d$ to $\{0,1\}^{\ell}$, our algorithm has regret $\tilde {\mathcal{O}}((Td\ell)^{\frac{1}{2}})$. In a fully agnostic setting when the mask is an arbitrary function mapping to a set of size $n$, our algorithm has regret $\tilde {\mathcal{O}}(T^{\frac{3}{4}}n^{\frac{1}{2}})$. Finally, when the prices are stochastic, the algorithm has regret $\tilde {\mathcal{O}}((Tn)^{\frac{1}{2}})$.
    Gradient Based Activations for Accurate Bias-Free Learning. (arXiv:2202.10943v1 [cs.LG])
    Bias mitigation in machine learning models is imperative, yet challenging. While several approaches have been proposed, one view towards mitigating bias is through adversarial learning. A discriminator is used to identify the bias attributes such as gender, age or race in question. This discriminator is used adversarially to ensure that it cannot distinguish the bias attributes. The main drawback in such a model is that it directly introduces a trade-off with accuracy as the features that the discriminator deems to be sensitive for discrimination of bias could be correlated with classification. In this work we solve the problem. We show that a biased discriminator can actually be used to improve this bias-accuracy tradeoff. Specifically, this is achieved by using a feature masking approach using the discriminator's gradients. We ensure that the features favoured for the bias discrimination are de-emphasized and the unbiased features are enhanced during classification. We show that this simple approach works well to reduce bias as well as improve accuracy significantly. We evaluate the proposed model on standard benchmarks. We improve the accuracy of the adversarial methods while maintaining or even improving the unbiasness and also outperform several other recent methods.
    SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization. (arXiv:1912.07109v2 [cs.CV] UPDATED)
    We propose SDFDiff, a novel approach for image-based shape optimization using differentiable rendering of 3D shapes represented by signed distance functions (SDFs). Compared to other representations, SDFs have the advantage that they can represent shapes with arbitrary topology, and that they guarantee watertight surfaces. We apply our approach to the problem of multi-view 3D reconstruction, where we achieve high reconstruction quality and can capture complex topology of 3D objects. In addition, we employ a multi-resolution strategy to obtain a robust optimization algorithm. We further demonstrate that our SDF-based differentiable renderer can be integrated with deep learning models, which opens up options for learning approaches on 3D objects without 3D supervision. In particular, we apply our method to single-view 3D reconstruction and achieve state-of-the-art results.
    Permutation-equivariant and Proximity-aware Graph Neural Networks with Stochastic Message Passing. (arXiv:2009.02562v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such as finding communities and leaders. In this paper, we first analytically show that the existing GNNs, mostly based on the message-passing mechanism, cannot simultaneously preserve the two properties. Then, we propose Stochastic Message Passing (SMP) model, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities, and at the same time, maintain permutation-equivariance with certain parametrization. We report extensive experimental results on ten datasets and demonstrate the effectiveness and efficiency of SMP for various typical graph mining tasks, including graph reconstruction, node classification, and link prediction.
    A Survey of Vision-Language Pre-Trained Models. (arXiv:2202.10936v1 [cs.CV])
    As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.
    Message passing all the way up. (arXiv:2202.11097v1 [cs.LG])
    The message passing framework is the foundation of the immense success enjoyed by graph neural networks (GNNs) in recent years. In spite of its elegance, there exist many problems it provably cannot solve over given input graphs. This has led to a surge of research on going "beyond message passing", building GNNs which do not suffer from those limitations -- a term which has become ubiquitous in regular discourse. However, have those methods truly moved beyond message passing? In this position paper, I argue about the dangers of using this term -- especially when teaching graph representation learning to newcomers. I show that any function of interest we want to compute over graphs can, in all likelihood, be expressed using pairwise message passing -- just over a potentially modified graph, and argue how most practical implementations subtly do this kind of trick anyway. Hoping to initiate a productive discussion, I propose replacing "beyond message passing" with a more tame term, "augmented message passing".
    An end-to-end predict-then-optimize clustering method for intelligent assignment problems in express systems. (arXiv:2202.10937v1 [cs.LG])
    Express systems play important roles in modern major cities. Couriers serving for the express system pick up packages in certain areas of interest (AOI) during a specific time. However, future pick-up requests vary significantly with time. While the assignment results are generally static without changing with time. Using the historical pick-up request number to conduct AOI assignment (or pick-up request assignment) for couriers is thus unreasonable. Moreover, even we can first predict future pick-up requests and then use the prediction results to conduct the assignments, this kind of two-stage method is also impractical and trivial, and exists some drawbacks, such as the best prediction results might not ensure the best clustering results. To solve these problems, we put forward an intelligent end-to-end predict-then-optimize clustering method to simultaneously predict the future pick-up requests of AOIs and assign AOIs to couriers by clustering. At first, we propose a deep learning-based prediction model to predict order numbers on AOIs. Then a differential constrained K-means clustering method is introduced to cluster AOIs based on the prediction results. We finally propose a one-stage end-to-end predict-then-optimize clustering method to assign AOIs to couriers reasonably, dynamically, and intelligently. Results show that this kind of one-stage predict-then-optimize method is beneficial to improve the performance of optimization results, namely the clustering results. This study can provide critical experiences for predict-and-optimize related tasks and intelligent assignment problems in express systems.
    Provably convergent quasistatic dynamics for mean-field two-player zero-sum games. (arXiv:2202.10947v1 [cs.GT])
    In this paper, we study the problem of finding mixed Nash equilibrium for mean-field two-player zero-sum games. Solving this problem requires optimizing over two probability distributions. We consider a quasistatic Wasserstein gradient flow dynamics in which one probability distribution follows the Wasserstein gradient flow, while the other one is always at the equilibrium. Theoretical analysis are conducted on this dynamics, showing its convergence to the mixed Nash equilibrium under mild conditions. Inspired by the continuous dynamics of probability distributions, we derive a quasistatic Langevin gradient descent method with inner-outer iterations, and test the method on different problems, including training mixture of GANs.
    Multi-Source Unsupervised Domain Adaptation via Pseudo Target Domain. (arXiv:2202.10725v1 [cs.LG])
    Multi-source domain adaptation (MDA) aims to transfer knowledge from multiple source domains to an unlabeled target domain. MDA is a challenging task due to the severe domain shift, which not only exists between target and source but also exists among diverse sources. Prior studies on MDA either estimate a mixed distribution of source domains or combine multiple single-source models, but few of them delve into the relevant information among diverse source domains. For this reason, we propose a novel MDA approach, termed Pseudo Target for MDA (PTMDA). Specifically, PTMDA maps each group of source and target domains into a group-specific subspace using adversarial learning with a metric constraint, and constructs a series of pseudo target domains correspondingly. Then we align the remainder source domains with the pseudo target domain in the subspace efficiently, which allows to exploit additional structured source information through the training on pseudo target domain and improves the performance on the real target domain. Besides, to improve the transferability of deep neural networks (DNNs), we replace the traditional batch normalization layer with an effective matching normalization layer, which enforces alignments in latent layers of DNNs and thus gains further promotion. We give theoretical analysis showing that PTMDA as a whole can reduce the target error bound and leads to a better approximation of the target risk in MDA settings. Extensive experiments demonstrate PTMDA's effectiveness on MDA tasks, as it outperforms state-of-the-art methods in most experimental settings.
    Transformation Coding: Simple Objectives for Equivariant Representations. (arXiv:2202.10930v1 [cs.LG])
    We present a simple non-generative approach to deep representation learning that seeks equivariant deep embedding through simple objectives. In contrast to existing equivariant networks, our transformation coding approach does not constrain the choice of the feed-forward layer or the architecture and allows for an unknown group action on the input space. We introduce several such transformation coding objectives for different Lie groups such as the Euclidean, Orthogonal and the Unitary groups. When using product groups, the representation is decomposed and disentangled. We show that the presence of additional information on different transformations improves disentanglement in transformation coding. We evaluate the representations learnt by transformation coding both qualitatively and quantitatively on downstream tasks, including reinforcement learning.
    Thinking the Fusion Strategy of Multi-reference Face Reenactment. (arXiv:2202.10758v1 [cs.CV])
    In recent advances of deep generative models, face reenactment -manipulating and controlling human face, including their head movement-has drawn much attention for its wide range of applicability. Despite its strong expressiveness, it is inevitable that the models fail to reconstruct or accurately generate unseen side of the face of a given single reference image. Most of existing methods alleviate this problem by learning appearances of human faces from large amount of data and generate realistic texture at inference time. Rather than completely relying on what generative models learn, we show that simple extension by using multiple reference images significantly improves generation quality. We show this by 1) conducting the reconstruction task on publicly available dataset, 2) conducting facial motion transfer on our original dataset which consists of multi-person's head movement video sequences, and 3) using a newly proposed evaluation metric to validate that our method achieves better quantitative results.
    Learning Dynamics and Structure of Complex Systems Using Graph Neural Networks. (arXiv:2202.10996v1 [cs.LG])
    Many complex systems are composed of interacting parts, and the underlying laws are usually simple and universal. While graph neural networks provide a useful relational inductive bias for modeling such systems, generalization to new system instances of the same type is less studied. In this work we trained graph neural networks to fit time series from an example nonlinear dynamical system, the belief propagation algorithm. We found simple interpretations of the learned representation and model components, and they are consistent with core properties of the probabilistic inference algorithm. We successfully identified a `graph translator' between the statistical interactions in belief propagation and parameters of the corresponding trained network, and showed that it enables two types of novel generalization: to recover the underlying structure of a new system instance based solely on time series observations, or to construct a new network from this structure directly. Our results demonstrated a path towards understanding both dynamics and structure of a complex system and how such understanding can be used for generalization.
    Convolutional Neural Network Modelling for MODIS Land Surface Temperature Super-Resolution. (arXiv:2202.10753v1 [cs.CV])
    Nowadays, thermal infrared satellite remote sensors enable to extract very interesting information at large scale, in particular Land Surface Temperature (LST). However such data are limited in spatial and/or temporal resolutions which prevents from an analysis at fine scales. For example, MODIS satellite provides daily acquisitions with 1Km spatial resolutions which is not sufficient to deal with highly heterogeneous environments as agricultural parcels. Therefore, image super-resolution is a crucial task to better exploit MODIS LSTs. This issue is tackled in this paper. We introduce a deep learning-based algorithm, named Multi-residual U-Net, for super-resolution of MODIS LST single-images. Our proposed network is a modified version of U-Net architecture, which aims at super-resolving the input LST image from 1Km to 250m per pixel. The results show that our Multi-residual U-Net outperforms other state-of-the-art methods.
    A Deep Learning Approach to Predicting Ventilator Parameters for Mechanically Ventilated Septic Patients. (arXiv:2202.10921v1 [q-bio.QM])
    We develop a deep learning approach to predicting a set of ventilator parameters for a mechanically ventilated septic patient using a long and short term memory (LSTM) recurrent neural network (RNN) model. We focus on short-term predictions of a set of ventilator parameters for the septic patient in emergency intensive care unit (EICU). The short-term predictability of the model provides attending physicians with early warnings to make timely adjustment to the treatment of the patient in the EICU. The patient specific deep learning model can be trained on any given critically ill patient, making it an intelligent aide for physicians to use in emergent medical situations.
    PyTorch Geometric Signed Directed: A Survey and Software on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v1 [cs.LG])
    Signed networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many signed networks are directed, there is a lack of survey papers and software packages on graph neural networks (GNNs) specially designed for directed networks. In this paper, we present PyTorch Geometric Signed Directed, a survey and software on GNNs for signed and directed networks. We review typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, and provide an overview of methods proposed. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. The software is presented in a modular fashion, so that signed and directed networks can also be treated separately. As an extension library for PyTorch Geometric, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.
    NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual Prediction of Human Reading Behavior in Universal Language Space. (arXiv:2202.10855v1 [cs.CL])
    In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study to favorable exploit this phonological property of language for the two tasks. Various feature types were extracted covering basic frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors for model training. A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAve) and mean total reading time (TRTAve) respectively.
    Explicit Regularization via Regularizer Mirror Descent. (arXiv:2202.10788v1 [cs.LG])
    Despite perfectly interpolating the training data, deep neural networks (DNNs) can often generalize fairly well, in part due to the "implicit regularization" induced by the learning algorithm. Nonetheless, various forms of regularization, such as "explicit regularization" (via weight decay), are often used to avoid overfitting, especially when the data is corrupted. There are several challenges with explicit regularization, most notably unclear convergence properties. Inspired by convergence properties of stochastic mirror descent (SMD) algorithms, we propose a new method for training DNNs with regularization, called regularizer mirror descent (RMD). In highly overparameterized DNNs, SMD simultaneously interpolates the training data and minimizes a certain potential function of the weights. RMD starts with a standard cost which is the sum of the training loss and a convex regularizer of the weights. Reinterpreting this cost as the potential of an "augmented" overparameterized network and applying SMD yields RMD. As a result, RMD inherits the properties of SMD and provably converges to a point "close" to the minimizer of this cost. RMD is computationally comparable to stochastic gradient descent (SGD) and weight decay, and is parallelizable in the same manner. Our experimental results on training sets with various levels of corruption suggest that the generalization performance of RMD is remarkably robust and significantly better than both SGD and weight decay, which implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can also be used to regularize the weights to a desired weight vector, which is particularly relevant for continual learning.
    Remaining Useful Life Prediction Using Temporal Deep Degradation Network for Complex Machinery with Attention-based Feature Extraction. (arXiv:2202.10916v1 [cs.LG])
    The precise estimate of remaining useful life (RUL) is vital for the prognostic analysis and predictive maintenance that can significantly reduce failure rate and maintenance costs. The degradation-related features extracted from the sensor streaming data with neural networks can dramatically improve the accuracy of the RUL prediction. The Temporal deep degradation network (TDDN) model is proposed to make the RUL prediction with the degradation-related features given by the one-dimensional convolutional neural network (1D CNN) feature extraction and attention mechanism. 1D CNN is used to extract the temporal features from the streaming sensor data. Temporal features have monotonic degradation trends from the fluctuating raw sensor streaming data. Attention mechanism can improve the RUL prediction performance by capturing the fault characteristics and the degradation development with the attention weights. The performance of the TDDN model is evaluated on the public C-MAPSS dataset and compared with the existing methods. The results show that the TDDN model can achieve the best RUL prediction accuracy in complex conditions compared to current machine learning models. The degradation-related features extracted from the high-dimension sensor streaming data demonstrate the clear degradation trajectories and degradation stages that enable TDDN to predict the turbofan-engine RUL accurately and efficiently.
    A Novel Anomaly Detection Method for Multimodal WSN Data Flow via a Dynamic Graph Neural Network. (arXiv:2202.10454v1 [cs.LG])
    Anomaly detection is widely used to distinguish system anomalies by analyzing the temporal and spatial features of wireless sensor network (WSN) data streams; it is one of critical technique that ensures the reliability of WSNs. Currently, graph neural networks (GNNs) have become popular state-of-the-art methods for conducting anomaly detection on WSN data streams. However, the existing anomaly detection methods based on GNNs do not consider the temporal and spatial features of WSN data streams simultaneously, such as multi-node, multi-modal and multi-time features, seriously impacting their effectiveness. In this paper, a novel anomaly detection model is proposed for multimodal WSN data flows, where three GNNs are used to separately extract the temporal features of WSN data flows, the correlation features between different modes and the spatial features between sensor node positions. Specifically, first, the temporal features and modal correlation features extracted from each sensor node are fused into one vector representation, which is further aggregated with the spatial features, i.e., the spatial position relationships of the nodes; finally, the current time-series data of WSN nodes are predicted, and abnormal states are identified according to the fusion features. The simulation results obtained on a public dataset show that the proposed approach is able to significantly improve upon the existing methods in terms of its robustness, and its F1 score reaches 0.90, which is 14.2% higher than that of the graph convolution network (GCN) with long short-term memory (LSTM).
    Improving Systematic Generalization Through Modularity and Augmentation. (arXiv:2202.10745v1 [cs.AI])
    Systematic generalization is the ability to combine known parts into novel meaning; an important aspect of efficient human learning, but a weakness of neural network learning. In this work, we investigate how two well-known modeling principles -- modularity and data augmentation -- affect systematic generalization of neural networks in grounded language learning. We analyze how large the vocabulary needs to be to achieve systematic generalization and how similar the augmented data needs to be to the problem at hand. Our findings show that even in the controlled setting of a synthetic benchmark, achieving systematic generalization remains very difficult. After training on an augmented dataset with almost forty times more adverbs than the original problem, a non-modular baseline is not able to systematically generalize to a novel combination of a known verb and adverb. When separating the task into cognitive processes like perception and navigation, a modular neural network is able to utilize the augmented data and generalize more systematically, achieving 70% and 40% exact match increase over state-of-the-art on two gSCAN tests that have not previously been improved. We hope that this work gives insight into the drivers of systematic generalization, and what we still need to improve for neural networks to learn more like humans do.
    UncertaINR: Uncertainty Quantification of End-to-End Implicit Neural Representations for Computed Tomography. (arXiv:2202.10847v1 [eess.IV])
    Implicit neural representations (INRs) have achieved impressive results for scene reconstruction and computer graphics, where their performance has primarily been assessed on reconstruction accuracy. However, in medical imaging, where the reconstruction problem is underdetermined and model predictions inform high-stakes diagnoses, uncertainty quantification of INR inference is critical. To that end, we study UncertaINR: a Bayesian reformulation of INR-based image reconstruction, for computed tomography (CT). We test several Bayesian deep learning implementations of UncertaINR and find that they achieve well-calibrated uncertainty, while retaining accuracy competitive with other classical, INR-based, and CNN-based reconstruction techniques. In contrast to the best-performing prior approaches, UncertaINR does not require a large training dataset, but only a handful of validation images.
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v1 [stat.ML])
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.
    Partial Identification with Noisy Covariates: A Robust Optimization Approach. (arXiv:2202.10665v1 [cs.LG])
    Causal inference from observational datasets often relies on measuring and adjusting for covariates. In practice, measurements of the covariates can often be noisy and/or biased, or only measurements of their proxies may be available. Directly adjusting for these imperfect measurements of the covariates can lead to biased causal estimates. Moreover, without additional assumptions, the causal effects are not point-identifiable due to the noise in these measurements. To this end, we study the partial identification of causal effects given noisy covariates, under a user-specified assumption on the noise level. The key observation is that we can formulate the identification of the average treatment effects (ATE) as a robust optimization problem. This formulation leads to an efficient robust optimization algorithm that bounds the ATE with noisy covariates. We show that this robust optimization approach can extend a wide range of causal adjustment methods to perform partial identification, including backdoor adjustment, inverse propensity score weighting, double machine learning, and front door adjustment. Across synthetic and real datasets, we find that this approach provides ATE bounds with a higher coverage probability than existing methods.
    Learning Infomax and Domain-Independent Representations for Causal Effect Inference with Real-World Data. (arXiv:2202.10885v1 [stat.ML])
    The foremost challenge to causal inference with real-world data is to handle the imbalance in the covariates with respect to different treatment options, caused by treatment selection bias. To address this issue, recent literature has explored domain-invariant representation learning based on different domain divergence metrics (e.g., Wasserstein distance, maximum mean discrepancy, position-dependent metric, and domain overlap). In this paper, we reveal the weaknesses of these strategies, i.e., they lead to the loss of predictive information when enforcing the domain invariance; and the treatment effect estimation performance is unstable, which heavily relies on the characteristics of the domain distributions and the choice of domain divergence metrics. Motivated by information theory, we propose to learn the Infomax and Domain-Independent Representations to solve the above puzzles. Our method utilizes the mutual information between the global feature representations and individual feature representations, and the mutual information between feature representations and treatment assignment predictions, in order to maximally capture the common predictive information for both treatment and control groups. Moreover, our method filters out the influence of instrumental and irrelevant variables, and thus it effectively increases the predictive ability of potential outcomes. Experimental results on both the synthetic and real-world datasets show that our method achieves state-of-the-art performance on causal effect inference. Moreover, our method exhibits reliable prediction performances when facing data with different characteristics of data distributions, complicated variable types, and severe covariate imbalance.
    Speciesist bias in AI -- How AI applications perpetuate discrimination and unfair outcomes against animals. (arXiv:2202.10848v1 [cs.LG])
    Massive efforts are made to reduce biases in both data and algorithms in order to render AI applications fair. These efforts are propelled by various high-profile cases where biased algorithmic decision-making caused harm to women, people of color, minorities, etc. However, the AI fairness field still succumbs to a blind spot, namely its insensitivity to discrimination against animals. This paper is the first to describe the 'speciesist bias' and investigate it in several different AI systems. Speciesist biases are learned and solidified by AI applications when they are trained on datasets in which speciesist patterns prevail. These patterns can be found in image recognition systems, large language models, and recommender systems. Therefore, AI technologies currently play a significant role in perpetuating and normalizing violence against animals. This can only be changed when AI fairness frameworks widen their scope and include mitigation measures for speciesist biases. This paper addresses the AI community in this regard and stresses the influence AI systems can have on either increasing or reducing the violence that is inflicted on animals, and especially on farmed animals.
    Confident Neural Network Regression with Bootstrapped Deep Ensembles. (arXiv:2202.10903v1 [stat.ML])
    With the rise of the popularity and usage of neural networks, trustworthy uncertainty estimation is becoming increasingly essential. In this paper we present a computationally cheap extension of Deep Ensembles for a regression setting called Bootstrapped Deep Ensembles that explicitly takes the effect of finite data into account using a modified version of the parametric bootstrap. We demonstrate through a simulation study that our method has comparable or better prediction intervals and superior confidence intervals compared to Deep Ensembles and other state-of-the-art methods. As an added bonus, our method is better capable of detecting overfitting than standard Deep Ensembles.
    Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks. (arXiv:2202.10765v1 [cs.RO])
    Rearrangement tasks have been identified as a crucial challenge for intelligent robotic manipulation, but few methods allow for precise construction of unseen structures. We propose a visual foresight model for pick-and-place manipulation which is able to learn efficiently. In addition, we develop a multi-modal action proposal module which builds on Goal-Conditioned Transporter Networks, a state-of-the-art imitation learning method. Our method, Transporters with Visual Foresight (TVF), enables task planning from image data and is able to achieve multi-task learning and zero-shot generalization to unseen tasks with only a handful of expert demonstrations. TVF is able to improve the performance of a state-of-the-art imitation learning method on both training and unseen tasks in simulation and real robot experiments. In particular, the average success rate on unseen tasks improves from 55.0% to 77.9% in simulation experiments and from 30% to 63.3% in real robot experiments when given only tens of expert demonstrations. More details can be found on our project website: https://chirikjianlab.github.io/tvf/
    Seeing is Living? Rethinking the Security of Facial Liveness Verification in the Deepfake Era. (arXiv:2202.10673v1 [cs.CR])
    Facial Liveness Verification (FLV) is widely used for identity authentication in many security-sensitive domains and offered as Platform-as-a-Service (PaaS) by leading cloud vendors. Yet, with the rapid advances in synthetic media techniques (e.g., deepfake), the security of FLV is facing unprecedented challenges, about which little is known thus far. To bridge this gap, in this paper, we conduct the first systematic study on the security of FLV in real-world settings. Specifically, we present LiveBugger, a new deepfake-powered attack framework that enables customizable, automated security evaluation of FLV. Leveraging LiveBugger, we perform a comprehensive empirical assessment of representative FLV platforms, leading to a set of interesting findings. For instance, most FLV APIs do not use anti-deepfake detection; even for those with such defenses, their effectiveness is concerning (e.g., it may detect high-quality synthesized videos but fail to detect low-quality ones). We then conduct an in-depth analysis of the factors impacting the attack performance of LiveBugger: a) the bias (e.g., gender or race) in FLV can be exploited to select victims; b) adversarial training makes deepfake more effective to bypass FLV; c) the input quality has a varying influence on different deepfake techniques to bypass FLV. Based on these findings, we propose a customized, two-stage approach that can boost the attack success rate by up to 70%. Further, we run proof-of-concept attacks on several representative applications of FLV (i.e., the clients of FLV APIs) to illustrate the practical implications: due to the vulnerability of the APIs, many downstream applications are vulnerable to deepfake. Finally, we discuss potential countermeasures to improve the security of FLV. Our findings have been confirmed by the corresponding vendors.
    Convergence of online $k$-means. (arXiv:2202.10640v1 [cs.LG])
    We prove asymptotic convergence for a general class of $k$-means algorithms performed over streaming data from a distribution: the centers asymptotically converge to the set of stationary points of the $k$-means cost function. To do so, we show that online $k$-means over a distribution can be interpreted as stochastic gradient descent with a stochastic learning rate schedule. Then, we prove convergence by extending techniques used in optimization literature to handle settings where center-specific learning rates may depend on the past trajectory of the centers.
    Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach. (arXiv:2202.10630v1 [cs.LG])
    Penetration Testing plays a critical role in evaluating the security of a target network by emulating real active adversaries. Deep Reinforcement Learning (RL) is seen as a promising solution to automating the process of penetration tests by reducing human effort and improving reliability. Existing RL solutions focus on finding a specific attack path to impact the target hosts. However, in reality, a diverse range of attack variations are needed to provide comprehensive assessments of the target network's security level. Hence, the attack agents must consider multiple objectives when penetrating the network. Nevertheless, this challenge is not adequately addressed in the existing literature. To this end, we formulate the automatic penetration testing in the Multi-Objective Reinforcement Learning (MORL) framework and propose a Chebyshev decomposition critic to find diverse adversary strategies that balance different objectives in the penetration test. Additionally, the number of available actions increases with the agent consistently probing the target network, making the training process intractable in many practical situations. Thus, we introduce a coverage-based masking mechanism that reduces attention on previously selected actions to help the agent adapt to future exploration. Experimental evaluation on a range of scenarios demonstrates the superiority of our proposed approach when compared to adapted algorithms in terms of multi-objective learning and performance efficiency.
    Decentralized Safe Multi-agent Stochastic Optimal Control using Deep FBSDEs and ADMM. (arXiv:2202.10658v1 [cs.MA])
    In this work, we propose a novel safe and scalable decentralized solution for multi-agent control in the presence of stochastic disturbances. Safety is mathematically encoded using stochastic control barrier functions and safe controls are computed by solving quadratic programs. Decentralization is achieved by augmenting to each agent's optimization variables, copy variables, for its neighbors. This allows us to decouple the centralized multi-agent optimization problem. However, to ensure safety, neighboring agents must agree on "what is safe for both of us" and this creates a need for consensus. To enable safe consensus solutions, we incorporate an ADMM-based approach. Specifically, we propose a Merged CADMM-OSQP implicit neural network layer, that solves a mini-batch of both, local quadratic programs as well as the overall consensus problem, as a single optimization problem. This layer is embedded within a Deep FBSDEs network architecture at every time step, to facilitate end-to-end differentiable, safe and decentralized stochastic optimal control. The efficacy of the proposed approach is demonstrated on several challenging multi-robot tasks in simulation. By imposing requirements on safety specified by collision avoidance constraints, the safe operation of all agents is ensured during the entire training process. We also demonstrate superior scalability in terms of computational and memory savings as compared to a centralized approach.
    On the Effectiveness of Adversarial Training against Backdoor Attacks. (arXiv:2202.10627v1 [cs.LG])
    DNNs' demand for massive data forces practitioners to collect data from the Internet without careful check due to the unacceptable cost, which brings potential risks of backdoor attacks. A backdoored model always predicts a target class in the presence of a predefined trigger pattern, which can be easily realized via poisoning a small amount of data. In general, adversarial training is believed to defend against backdoor attacks since it helps models to keep their prediction unchanged even if we perturb the input image (as long as within a feasible range). Unfortunately, few previous studies succeed in doing so. To explore whether adversarial training could defend against backdoor attacks or not, we conduct extensive experiments across different threat models and perturbation budgets, and find the threat model in adversarial training matters. For instance, adversarial training with spatial adversarial examples provides notable robustness against commonly-used patch-based backdoor attacks. We further propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
    T-METASET: Task-Aware Generation of Metamaterial Datasets by Diversity-Based Active Learning. (arXiv:2202.10565v1 [cs.CE])
    Inspired by the recent success of deep learning in diverse domains, data-driven metamaterials design has emerged as a compelling design paradigm to unlock the potential of multiscale architecture. However, existing model-centric approaches lack principled methodologies dedicated to high-quality data generation. Resorting to space-filling design in shape descriptor space, existing metamaterial datasets suffer from property distributions that are either highly imbalanced or at odds with design tasks of interest. To this end, we propose t-METASET: an intelligent data acquisition framework for task-aware dataset generation. We seek a solution to a commonplace yet frequently overlooked scenario at early design stages: when a massive ($~\sim O(10^4)$) shape library has been prepared with no properties evaluated. The key idea is to exploit a data-driven shape descriptor learned from generative models, fit a sparse regressor as the start-up agent, and leverage diversity-related metrics to drive data acquisition to areas that help designers fulfill design goals. We validate the proposed framework in three hypothetical deployment scenarios, which encompass general use, task-aware use, and tailorable use. Two large-scale shape-only mechanical metamaterial datasets are used as test datasets. The results demonstrate that t-METASET can incrementally grow task-aware datasets. Applicable to general design representations, t-METASET can boost future advancements of not only metamaterials but data-driven design in other domains.
    Data-Driven Traffic Assignment: A Novel Approach for Learning Traffic Flow Patterns Using a Graph Convolutional Neural Network. (arXiv:2202.10508v1 [cs.LG])
    We present a novel data-driven approach of learning traffic flow patterns of a transportation network given that many instances of origin to destination (OD) travel demand and link flows of the network are available. Instead of estimating traffic flow patterns assuming certain user behavior (e.g., user equilibrium or system optimal), here we explore the idea of learning those flow patterns directly from the data. To implement this idea, we have formulated the traffic-assignment problem as a data-driven learning problem and developed a neural network-based framework known as Graph Convolutional Neural Network (GCNN) to solve it. The proposed framework represents the transportation network and OD demand in an efficient way and utilizes the diffusion process of multiple OD demands from nodes to links. We validate the solutions of the model against analytical solutions generated from running static user equilibrium-based traffic assignments over Sioux Falls and East Massachusetts networks. The validation result shows that the implemented GCNN model can learn the flow patterns very well with less than 2% mean absolute difference between the actual and estimated link flows for both networks under varying congested conditions. When the training of the model is complete, it can instantly determine the traffic flows of a large-scale network. Hence this approach can overcome the challenges of deploying traffic assignment models over large-scale networks and open new directions of research in data-driven network modeling.
    Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer. (arXiv:2202.10587v1 [cs.LG])
    Machine learning, especially deep learning, has greatly advanced molecular studies in the biochemical domain. Most typically, modeling for most molecular tasks have converged to several paradigms. For example, we usually adopt the prediction paradigm to solve tasks of molecular property prediction. To improve the generation and interpretability of purely data-driven models, researchers have incorporated biochemical domain knowledge into these models for molecular studies. This knowledge incorporation has led to a rising trend of paradigm transfer, which is solving one molecular learning task by reformulating it as another one. In this paper, we present a literature review towards knowledge-informed molecular learning in perspective of paradigm transfer, where we categorize the paradigms, review their methods and analyze how domain knowledge contributes. Furthermore, we summarize the trends and point out interesting future directions for molecular learning.
    Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey. (arXiv:2202.10594v1 [cs.SD])
    A Machine-Critical Application is a system that is fundamentally necessary to the success of specific and sensitive operations such as search and recovery, rescue, military, and emergency management actions. Recent advances in Machine Learning, Natural Language Processing, voice recognition, and speech processing technologies have naturally allowed the development and deployment of speech-based conversational interfaces to interact with various machine-critical applications. While these conversational interfaces have allowed users to give voice commands to carry out strategic and critical activities, their robustness to adversarial attacks remains uncertain and unclear. Indeed, Adversarial Artificial Intelligence (AI) which refers to a set of techniques that attempt to fool machine learning models with deceptive data, is a growing threat in the AI and machine learning research community, in particular for machine-critical applications. The most common reason of adversarial attacks is to cause a malfunction in a machine learning model. An adversarial attack might entail presenting a model with inaccurate or fabricated samples as it's training data, or introducing maliciously designed data to deceive an already trained model. While focusing on speech recognition for machine-critical applications, in this paper, we first review existing speech recognition techniques, then, we investigate the effectiveness of adversarial attacks and defenses against these systems, before outlining research challenges, defense recommendations, and future work. This paper is expected to serve researchers and practitioners as a reference to help them in understanding the challenges, position themselves and, ultimately, help them to improve existing models of speech recognition for mission-critical applications. Keywords: Mission-Critical Applications, Adversarial AI, Speech Recognition Systems.
    Benchmarking missing-values approaches for predictive models on health databases. (arXiv:2202.10580v1 [cs.LG])
    BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS: Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values-with missing incorporated attribute-leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS: Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.
    Graph Lifelong Learning: A Survey. (arXiv:2202.10688v1 [cs.LG])
    Graph learning substantially contributes to solving artificial intelligence (AI) tasks in various graph-related domains such as social networks, biological networks, recommender systems, and computer vision. However, despite its unprecedented prevalence, addressing the dynamic evolution of graph data over time remains a challenge. In many real-world applications, graph data continuously evolves. Current graph learning methods that assume graph representation is complete before the training process begins are not applicable in this setting. This challenge in graph learning motivates the development of a continuous learning process called graph lifelong learning to accommodate the future and refine the previous knowledge in graph data. Unlike existing survey papers that focus on either lifelong learning or graph learning separately, this survey paper covers the motivations, potentials, state-of-the-art approaches (that are well categorized), and open issues of graph lifelong learning. We expect extensive research and development interest in this emerging field.
    Targeting occupant feedback using digital twins: Adaptive spatial-temporal thermal preference sampling to optimize personal comfort models. (arXiv:2202.10707v1 [cs.LG])
    Collecting intensive longitudinal thermal preference data from building occupants is emerging as an innovative means of characterizing the performance of buildings and the people who use them. These techniques have occupants giving subjective feedback using smartphones or smartwatches frequently over the course of days or weeks. The intention is that the data will be collected with high spatial and temporal diversity to best characterize a building and the occupant's preferences. But in reality, leaving the occupant to respond in an ad-hoc or fixed interval way creates unneeded survey fatigue and redundant data. This paper outlines a scenario-based (virtual experiment) method for optimizing data sampling using a smartwatch to achieve comparable accuracy in a personal thermal preference model with less data. This method uses BIM-extracted spatial data, and Graph Neural Network (GNN) based modeling to find regions of similar comfort preference to identify the best scenarios for triggering the occupant to give feedback. This method is compared to two baseline scenarios based on the spatial context of specific spaces and 4 x 4 m grid squares in the building using a theoretical implementation on two field-collected data sets. The results show that the proposed Build2Vec method is 18-23% more in the overall sampling quality than the spaces-based and the square-grid-based sampling methods. The Build2Vec method also has similar performance to the baselines when removing redundant occupant feedback points but with better scalability potential.
    Online Caching with Optimistic Learning. (arXiv:2202.10590v1 [cs.NI])
    The design of effective online caching policies is an increasingly important problem for content distribution networks, online social networks and edge computing services, among other areas. This paper proposes a new algorithmic toolbox for tackling this problem through the lens of optimistic online learning. We build upon the Follow-the-Regularized-Leader (FTRL) framework which is developed further here to include predictions for the file requests, and we design online caching algorithms for bipartite networks with fixed-size caches or elastic leased caches subject to time-average budget constraints. The predictions are provided by a content recommendation system that influences the users viewing activity, and hence can naturally reduce the caching network's uncertainty about future requests. We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions, and maintain the best achievable regret bound $O(\sqrt T)$ even for arbitrary-bad predictions. The performance of the proposed algorithms is evaluated with detailed trace-driven numerical tests.
    CROMOSim: A Deep Learning-based Cross-modality Inertial Measurement Simulator. (arXiv:2202.10562v1 [eess.SP])
    With the prevalence of wearable devices, inertial measurement unit (IMU) data has been utilized in monitoring and assessment of human mobility such as human activity recognition (HAR). Training deep neural network (DNN) models for these tasks require a large amount of labeled data, which are hard to acquire in uncontrolled environments. To mitigate the data scarcity problem, we design CROMOSim, a cross-modality sensor simulator that simulates high fidelity virtual IMU sensor data from motion capture systems or monocular RGB cameras. It utilizes a skinned multi-person linear model (SMPL) for 3D body pose and shape representations, to enable simulation from arbitrary on-body positions. A DNN model is trained to learn the functional mapping from imperfect trajectory estimations in a 3D SMPL body tri-mesh due to measurement noise, calibration errors, occlusion and other modeling artifacts, to IMU data. We evaluate the fidelity of CROMOSim simulated data and its utility in data augmentation on various HAR datasets. Extensive experiment results show that the proposed model achieves a 6.7% improvement over baseline methods in a HAR task.
    Transition Matrix Representation of Trees with Transposed Convolutions. (arXiv:2202.10677v1 [cs.LG])
    How can we effectively find the best structures in tree models? Tree models have been favored over complex black box models in domains where interpretability is crucial for making irreversible decisions. However, searching for a tree structure that gives the best balance between the performance and the interpretability remains a challenging task. In this paper, we propose TART (Transition Matrix Representation with Transposed Convolutions), our novel generalized tree representation for optimal structural search. TART represents a tree model with a series of transposed convolutions that boost the speed of inference by avoiding the creation of transition matrices. As a result, TART allows one to search for the best tree structure with a few design parameters, achieving higher classification accuracy than those of baseline models in feature-based datasets.
    Connecting Optimization and Generalization via Gradient Flow Path Length. (arXiv:2202.10670v1 [stat.ML])
    Optimization and generalization are two essential aspects of machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the length of optimization trajectory under the gradient flow algorithm after convergence. Through our approach, we show that, with a proper initialization, gradient flow converges following a short path with an explicit length estimate. Such an estimate induces a length-based generalization bound, showing that short optimization paths after convergence are associated with good generalization, which also matches our numerical results. Our framework can be applied to broad settings. For example, we use it to obtain generalization estimates on three distinct machine learning models: underdetermined $\ell_p$ linear regression, kernel regression, and overparameterized two-layer ReLU neural networks.
    Effective Training Strategies for Deep-learning-based Precipitation Nowcasting and Estimation. (arXiv:2202.10555v1 [cs.CV])
    Deep learning has been successfully applied to precipitation nowcasting. In this work, we propose a pre-training scheme and a new loss function for improving deep-learning-based nowcasting. First, we adapt U-Net, a widely-used deep-learning model, for the two problems of interest here: precipitation nowcasting and precipitation estimation from radar images. We formulate the former as a classification problem with three precipitation intervals and the latter as a regression problem. For these tasks, we propose to pre-train the model to predict radar images in the near future without requiring ground-truth precipitation, and we also propose the use of a new loss function for fine-tuning to mitigate the class imbalance problem. We demonstrate the effectiveness of our approach using radar images and precipitation datasets collected from South Korea over seven years. It is highlighted that our pre-training scheme and new loss function improve the critical success index (CSI) of nowcasting of heavy rainfall (at least 10 mm/hr) by up to 95.7% and 43.6%, respectively, at a 5-hr lead time. We also demonstrate that our approach reduces the precipitation estimation error by up to 10.7%, compared to the conventional approach, for light rainfall (between 1 and 10 mm/hr). Lastly, we report the sensitivity of our approach to different resolutions and a detailed analysis of four cases of heavy rainfall.
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v1 [stat.ML])
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.
    MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned. (arXiv:2202.10583v1 [cs.LG])
    Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task.
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v1 [cs.LG])
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.
    Knowledge Base Question Answering by Case-based Reasoning over Subgraphs. (arXiv:2202.10610v1 [cs.CL])
    Question answering (QA) over real-world knowledge bases (KBs) is challenging because of the diverse (essentially unbounded) types of reasoning patterns needed. However, we hypothesize in a large KB, reasoning patterns required to answer a query type reoccur for various entities in their respective subgraph neighborhoods. Leveraging this structural similarity between local neighborhoods of different subgraphs, we introduce a semiparametric model with (i) a nonparametric component that for each query, dynamically retrieves other similar $k$-nearest neighbor (KNN) training queries along with query-specific subgraphs and (ii) a parametric component that is trained to identify the (latent) reasoning patterns from the subgraphs of KNN queries and then apply it to the subgraph of the target query. We also propose a novel algorithm to select a query-specific compact subgraph from within the massive knowledge graph (KG), allowing us to scale to full Freebase KG containing billions of edges. We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
    Strategic Ranking. (arXiv:2109.08240v2 [cs.GT] UPDATED)
    Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant's post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs, belonging to a family of step functions, trade off applicant, school, and societal utility, as well as how ranking design counters inequities arising from disparate access to resources. In particular, we find that randomization in the reward design can mitigate two measures of disparate impact, welfare gap and access.
    Modality Fusion Network and Personalized Attention in Momentary Stress Detection in the Wild. (arXiv:2107.09510v3 [eess.SP] UPDATED)
    Multimodal wearable physiological data in daily life have been used to estimate self-reported stress labels. However, missing data modalities in data collection makes it challenging to leverage all the collected samples. Besides, heterogeneous sensor data and labels among individuals add challenges in building robust stress detection models. In this paper, we proposed a modality fusion network (MFN) to train models and infer self-reported binary stress labels under both complete and incomplete modality conditions. In addition, we applied personalized attention (PA) strategy to leverage personalized representation along with the generalized one-size-fits-all model. We evaluated our methods on a multimodal wearable sensor dataset (N=41) including galvanic skin response (GSR) and electrocardiogram (ECG). Compared to the baseline method using the samples with complete modalities, the performance of the MFN improved by 1.6% in f1-scores. On the other hand, the proposed PA strategy showed a 2.3% higher stress detection f1-score and approximately up to 70% reduction in personalized model parameter size (9.1 MB) compared to the previous state-of-the-art transfer learning strategy (29.3 MB).
    Nearest neighbor search with compact codes: A decoder perspective. (arXiv:2112.09568v2 [cs.CV] UPDATED)
    Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the mean squared error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods or product quantization on popular benchmarks.
    Click-Through Rate Prediction in Online Advertising: A Literature Review. (arXiv:2202.10462v1 [cs.IR])
    Predicting the probability that a user will click on a specific advertisement has been a prevalent issue in online advertising, attracting much research attention in the past decades. As a hot research frontier driven by industrial needs, recent years have witnessed more and more novel learning models employed to improve advertising CTR prediction. Although extant research provides necessary details on algorithmic design for addressing a variety of specific problems in advertising CTR prediction, the methodological evolution and connections between modeling frameworks are precluded. However, to the best of our knowledge, there are few comprehensive surveys on this topic. We make a systematic literature review on state-of-the-art and latest CTR prediction research, with a special focus on modeling frameworks. Specifically, we give a classification of state-of-the-art CTR prediction models in the extant literature, within which basic modeling frameworks and their extensions, advantages and disadvantages, and performance assessment for CTR prediction are presented. Moreover, we summarize CTR prediction models with respect to the complexity and the order of feature interactions, and performance comparisons on various datasets. Furthermore, we identify current research trends, main challenges and potential future directions worthy of further explorations. This review is expected to provide fundamental knowledge and efficient entry points for IS and marketing scholars who want to engage in this area.
    Batched Dueling Bandits. (arXiv:2202.10660v1 [cs.LG])
    The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.
    Non-Volatile Memory Accelerated Posterior Estimation. (arXiv:2202.10522v1 [cs.LG])
    Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a single learnable parameter combination when making predictions, and as a result are highly overconfident when their predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a well known distribution such as a Gaussian. In this paper, we show that through the use of high-capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved predictions in downstream tasks.
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v3 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.
    Simultaneous Transport Evolution for Minimax Equilibria on Measures. (arXiv:2202.06460v2 [cs.LG] UPDATED)
    Min-max optimization problems arise in several key machine learning setups, including adversarial learning and generative modeling. In their general form, in absence of convexity/concavity assumptions, finding pure equilibria of the underlying two-player zero-sum game is computationally hard [Daskalakis et al., 2021]. In this work we focus instead in finding mixed equilibria, and consider the associated lifted problem in the space of probability measures. By adding entropic regularization, our main result establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric -- a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent. We complement this positive result with a related entropy-regularized loss which is not bilinear but still convex-concave in the Wasserstein geometry, and for which simultaneous dynamics do not converge yet timescale separation does. Taken together, these results showcase the benign geometry of bilinear games in the space of measures, enabling particle dynamics with global qualitative convergence guarantees.
    Reconstruction on Trees and Low-Degree Polynomials. (arXiv:2109.06915v2 [math.PR] UPDATED)
    The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, graphical models, phylogenetic reconstruction, MCMC algorithms, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves optimal performance for the reconstruction problem of predicting the value of the Markov process at the root of the tree from its values at the leaves. Recently, the analysis of low-degree polynomials has emerged as a valuable tool for predicting computational-to-statistical gaps. In this work, we investigate the performance of low-degree polynomials for the reconstruction problem. Perhaps surprisingly, we show that there are simple tree models of fixed arity $d$ and growing depth $\ell$ (so $N = 2^{\ell \log_2(d)}$ leaves) where (1) nontrivial reconstruction of the root value is possible with a simple polynomial time algorithm and with robustness to noise, but not with any polynomial of degree $2^{c \ell} = N^{c/\log_2(d)}$ for $c > 0$ a constant, and (2) when the tree is unknown and given multiple samples with correlated root assignments, nontrivial reconstruction of the root value is possible with a simple, noise-robust, and computationally efficient SQ algorithm but not with any polynomial of degree $2^{c \ell}$. These results clarify limitations of low-degree polynomials vs. polynomial time algorithms for Bayesian estimation problems. They also complement recent work of Moitra, Mossel, and Sandon who studied the circuit complexity of Belief Propagation. As a consequence of our main result, we show that for some $c' > 0$, $\exp(2^{c'\ell}) = \exp(N^{c'/\log_2(d)})$ many samples are needed for RBF kernel regression to obtain nontrivial correlation with the true regression function (BP). We pose related open questions about low-degree polynomials and the Kesten-Stigum threshold.
    Correlation Robust Influence Maximization. (arXiv:2010.14620v2 [cs.SI] UPDATED)
    We propose a distributionally robust model for the influence maximization problem. Unlike the classic independent cascade model \citep{kempe2003maximizing}, this model's diffusion process is adversarially adapted to the choice of seed set. Hence, instead of optimizing under the assumption that all influence relationships in the network are independent, we seek a seed set whose expected influence under the worst correlation, i.e. the "worst-case, expected influence", is maximized. We show that this worst-case influence can be efficiently computed, and though the optimization is NP-hard, a ($1 - 1/e$) approximation guarantee holds. We also analyze the structure to the adversary's choice of diffusion process, and contrast with established models. Beyond the key computational advantages, we also highlight the extent to which the independence assumption may cost optimality, and provide insights from numerical experiments comparing the adversarial and independent cascade model.
    Enhancing operations management through smart sensors: measuring and improving well-being, interaction and performance of logistics workers. (arXiv:2112.08213v2 [physics.soc-ph] UPDATED)
    Purpose The purpose of the research is to conduct an exploratory investigation of the material handling activities of an Italian logistics hub. Wearable sensors and other smart tools were used for collecting human and environmental features during working activities. These factors were correlated with workers' performance and well-being. Design/methodology/approach Human and environmental factors play an important role in operations management activities since they significantly influence employees' performance, well-being and safety. Surprisingly, empirical studies about the impact of such aspects on logistics operations are still very limited. Trying to fill this gap, the research empirically explores human and environmental factors affecting the performance of logistics workers exploiting smart tools. Findings Results suggest that human attitudes, interactions, emotions and environmental conditions remarkably influence workers' performance and well-being, however, showing different relationships depending on individual characteristics of each worker. Practical implications The authors' research opens up new avenues for profiling employees and adopting an individualized human resource management, providing managers with an operational system capable to potentially check and improve workers' well-being and performance. Originality/value The originality of the study comes from the in-depth exploration of human and environmental factors using body-worn sensors during work activities, by recording individual, collaborative and environmental data in real-time. To the best of the authors' knowledge, the current paper is the first time that such a detailed analysis has been carried out in real-world logistics operations.
    Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks. (arXiv:2202.10541v1 [cs.LG])
    Deep-learning-based intelligent services have become prevalent in cyber-physical applications including smart cities and health-care. Deploying deep-learning-based intelligence near the end-user enhances privacy protection, responsiveness, and reliability. Resource-constrained end-devices must be carefully managed in order to meet the latency and energy requirements of computationally-intensive deep learning services. Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency that can address application requirements through computation offloading. The decision to offload computation is a communication-computation co-optimization problem that varies with both system parameters (e.g., network condition) and workload characteristics (e.g., inputs). On the other hand, deep learning model optimization provides another source of tradeoff between latency and model accuracy. An end-to-end decision-making solution that considers such computation-communication problem is required to synergistically find the optimal offloading policy and model for deep learning services. To this end, we propose a reinforcement-learning-based computation offloading solution that learns optimal offloading policy considering deep learning model selection techniques to minimize response time while providing sufficient accuracy. We demonstrate the effectiveness of our solution for edge devices in an end-edge-cloud system and evaluate with a real-setup implementation using multiple AWS and ARM core configurations. Our solution provides 35% speedup in the average response time compared to the state-of-the-art with less than 0.9% accuracy reduction, demonstrating the promise of our online learning framework for orchestrating DL inference in end-edge-cloud systems.
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v1 [stat.ML])
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
    Contrastive-mixup learning for improved speaker verification. (arXiv:2202.10672v1 [eess.AS])
    This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited.
    Semi-Implicit Hybrid Gradient Methods with Application to Adversarial Robustness. (arXiv:2202.10523v1 [cs.LG])
    Adversarial examples, crafted by adding imperceptible perturbations to natural inputs, can easily fool deep neural networks (DNNs). One of the most successful methods for training adversarially robust DNNs is solving a nonconvex-nonconcave minimax problem with an adversarial training (AT) algorithm. However, among the many AT algorithms, only Dynamic AT (DAT) and You Only Propagate Once (YOPO) guarantee convergence to a stationary point. In this work, we generalize the stochastic primal-dual hybrid gradient algorithm to develop semi-implicit hybrid gradient methods (SI-HGs) for finding stationary points of nonconvex-nonconcave minimax problems. SI-HGs have the convergence rate $O(1/K)$, which improves upon the rate $O(1/K^{1/2})$ of DAT and YOPO. We devise a practical variant of SI-HGs, and show that it outperforms other AT algorithms in terms of convergence speed and robustness.
    It Takes Four to Tango: Multiagent Selfplay for Automatic Curriculum Generation. (arXiv:2202.10608v1 [cs.LG])
    We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goals in a sample efficient manner and (c) does not catastrophically forget previously solved goals. We propose Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy these desiderata by virtue of a multi-player game with four agents. We extend the asymmetric curricula learning in PAIRED (Dennis et al., 2020) to a symmetrized game that carefully balances cooperation and competition between two off-policy student learners and two regret-maximizing teachers. CuSP additionally introduces entropic goal coverage and accounts for the non-stationary nature of the students, allowing us to automatically induce a curriculum that balances progressive exploration with anti-catastrophic exploitation. We demonstrate that our method succeeds at generating an effective curricula of goals for a range of control tasks, outperforming other methods at zero-shot test-time generalization to novel out-of-distribution goals.
    Feasibility Study of Multi-Site Split Learning for Privacy-Preserving Medical Systems under Data Imbalance Constraints in COVID-19, X-Ray, and Cholesterol Dataset. (arXiv:2202.10456v1 [cs.LG])
    It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research. Yet, as with all data sharing technology, the challenge is to balance improved treatment with protecting patient's personal information. This paper provides a novel split learning algorithm coined the term, "multi-site split learning", which enables a secure transfer of medical data between multiple hospitals without fear of exposing personal data contained in patient records. It also explores the effects of varying the number of end-systems and the ratio of data-imbalance on the deep learning performance. A guideline for the most optimal configuration of split learning that ensures privacy of patient data whilst achieving performance is empirically given. We argue the benefits of our multi-site split learning algorithm, especially regarding the privacy preserving factor, using CT scans of COVID-19 patients, X-ray bone scans, and cholesterol level medical data.
    Personalized PATE: Differential Privacy for Machine Learning with Individual Privacy Guarantees. (arXiv:2202.10517v1 [cs.LG])
    Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the costs of the resulting ML models' utility. One reason for this is that DP uses one homogeneous privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders might have different privacy requirements and data points of data holders with lower requirements could potentially contribute more information to the training process of the ML models. To account for this possibility, we propose three novel methods that extend the DP framework Private Aggregation of Teacher Ensembles (PATE) to support training an ML model with different personalized privacy guarantees within the training data. We formally describe the methods, provide theoretical analyses of their privacy bounds, and experimentally evaluate their effect on the final model's utility at the example of the MNIST and Adult income datasets. Our experiments show that our personalized privacy methods yield higher accuracy models than the non-personalized baseline. Thereby, our methods can improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different privacy levels.
    Physics-Informed Graph Learning: A Survey. (arXiv:2202.10679v1 [cs.LG])
    An expeditious development of graph learning in recent years has found innumerable applications in several diversified fields. Of the main associated challenges are the volume and complexity of graph data. A lot of research has been evolving around the preservation of graph data in a low dimensional space. The graph learning models suffer from the inability to maintain original graph information. In order to compensate for this inability, physics-informed graph learning (PIGL) is emerging. PIGL incorporates physics rules while performing graph learning, which enables numerous potentials. This paper presents a systematic review of PIGL methods. We begin with introducing a unified framework of graph learning models, and then examine existing PIGL methods in relation to the unified framework. We also discuss several future challenges for PIGL. This survey paper is expected to stimulate innovative research and development activities pertaining to PIGL.
    Extragradient Method: $O(1/K)$ Last-Iterate Convergence for Monotone Variational Inequalities and Connections With Cocoercivity. (arXiv:2110.04261v2 [math.OC] UPDATED)
    Extragradient method (EG) (Korpelevich, 1976) is one of the most popular methods for solving saddle point and variational inequalities problems (VIP). Despite its long history and significant attention in the optimization community, there remain important open questions about convergence of EG. In this paper, we resolve one of such questions and derive the first last-iterate $O(1/K)$ convergence rate for EG for monotone and Lipschitz VIP without any additional assumptions on the operator unlike the only known result of this type (Golowich et al., 2020) that relies on the Lipschitzness of the Jacobian of the operator. The rate is given in terms of reducing the squared norm of the operator. Moreover, we establish several results on the (non-)cocoercivity of the update operators of EG, Optimistic Gradient Method, and Hamiltonian Gradient Method, when the original operator is monotone and Lipschitz.
    Differential Secrecy for Distributed Data and Applications to Robust Differentially Secure Vector Summation. (arXiv:2202.10618v1 [cs.CR])
    Computing the noisy sum of real-valued vectors is an important primitive in differentially private learning and statistics. In private federated learning applications, these vectors are held by client devices, leading to a distributed summation problem. Standard Secure Multiparty Computation (SMC) protocols for this problem are susceptible to poisoning attacks, where a client may have a large influence on the sum, without being detected. In this work, we propose a poisoning-robust private summation protocol in the multiple-server setting, recently studied in PRIO. We present a protocol for vector summation that verifies that the Euclidean norm of each contribution is approximately bounded. We show that by relaxing the security constraint in SMC to a differential privacy like guarantee, one can improve over PRIO in terms of communication requirements as well as the client-side computation. Unlike SMC algorithms that inevitably cast integers to elements of a large finite field, our algorithms work over integers/reals, which may allow for additional efficiencies.
    Guidelines and evaluation for clinical explainable AI on medical image analysis. (arXiv:2202.10553v1 [cs.LG])
    Explainable artificial intelligence (XAI) is essential for enabling clinical users to get informed decision support from AI and comply with evidence-based medical practice. Applying XAI in clinical settings requires proper evaluation criteria to ensure the explanation technique is both technically sound and clinically useful, but specific support is lacking to achieve this goal. To bridge the research gap, we propose the Clinical XAI Guidelines that consist of five criteria a clinical XAI needs to be optimized for. The guidelines recommend choosing an explanation form based on Guideline 1 (G1) Understandability and G2 Clinical relevance. For the chosen explanation form, its specific XAI technique should be optimized for G3 Truthfulness, G4 Informative plausibility, and G5 Computational efficiency. Following the guidelines, we conducted a systematic evaluation on a novel problem of multi-modal medical image explanation with two clinical tasks, and proposed new evaluation metrics accordingly. The evaluated 16 commonly-used heatmap XAI techniques were not suitable for clinical use due to their failure in \textbf{G3} and \textbf{G4}. Our evaluation demonstrated the use of Clinical XAI Guidelines to support the design and evaluation for clinically viable XAI.
    Order-Optimal Error Bounds for Noisy Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v1 [stat.ML])
    In this paper, we study the sample complexity of {\em noisy Bayesian quadrature} (BQ), in which we seek to approximate an integral based on noisy black-box queries to the underlying function. We consider functions in a {\em Reproducing Kernel Hilbert Space} (RKHS) with the Mat\'ern-$\nu$ kernel, focusing on combinations of the parameter $\nu$ and dimension $d$ such that the RKHS is equivalent to a Sobolev class. In this setting, we provide near-matching upper and lower bounds on the best possible average error. Specifically, we find that when the black-box queries are subject to Gaussian noise having variance $\sigma^2$, any algorithm making at most $T$ queries (even with adaptive sampling) must incur a mean absolute error of $\Omega(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$, and there exists a non-adaptive algorithm attaining an error of at most $O(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$. Hence, the bounds are order-optimal, and establish that there is no adaptivity gap in terms of scaling laws.
    Deep Iterative Phase Retrieval for Ptychography. (arXiv:2202.10573v1 [eess.IV])
    One of the most prominent challenges in the field of diffractive imaging is the phase retrieval (PR) problem: In order to reconstruct an object from its diffraction pattern, the inverse Fourier transform must be computed. This is only possible given the full complex-valued diffraction data, i.e. magnitude and phase. However, in diffractive imaging, generally only magnitudes can be directly measured while the phase needs to be estimated. In this work we specifically consider ptychography, a sub-field of diffractive imaging, where objects are reconstructed from multiple overlapping diffraction images. We propose an augmentation of existing iterative phase retrieval algorithms with a neural network designed for refining the result of each iteration. For this purpose we adapt and extend a recently proposed architecture from the speech processing field. Evaluation results show the proposed approach delivers improved convergence rates in terms of both iteration count and algorithm runtime.
    Dynamic Relation Discovery and Utilization in Multi-Entity Time Series Forecasting. (arXiv:2202.10586v1 [cs.LG])
    Time series forecasting plays a key role in a variety of domains. In a lot of real-world scenarios, there exist multiple forecasting entities (e.g. power station in the solar system, stations in the traffic system). A straightforward forecasting solution is to mine the temporal dependency for each individual entity through 1d-CNN, RNN, transformer, etc. This approach overlooks the relations between these entities and, in consequence, loses the opportunity to improve performance using spatial-temporal relation. However, in many real-world scenarios, beside explicit relation, there could exist crucial yet implicit relation between entities. How to discover the useful implicit relation between entities and effectively utilize the relations for each entity under various circumstances is crucial. In order to mine the implicit relation between entities as much as possible and dynamically utilize the relation to improve the forecasting performance, we propose an attentional multi-graph neural network with automatic graph learning (A2GNN) in this work. Particularly, a Gumbel-softmax based auto graph learner is designed to automatically capture the implicit relation among forecasting entities. We further propose an attentional relation learner that enables every entity to dynamically pay attention to its preferred relations. Extensive experiments are conducted on five real-world datasets from three different domains. The results demonstrate the effectiveness of A2GNN beyond several state-of-the-art methods.
    Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses. (arXiv:2202.10453v1 [cs.CV])
    Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition. Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by both visual and auditory information. The dataset is made publicly available.
    Debiasing Backdoor Attack: A Benign Application of Backdoor Attack in Eliminating Data Bias. (arXiv:2202.10582v1 [cs.CR])
    Backdoor attack is a new AI security risk that has emerged in recent years. Drawing on the previous research of adversarial attack, we argue that the backdoor attack has the potential to tap into the model learning process and improve model performance. Based on Clean Accuracy Drop (CAD) in backdoor attack, we found that CAD came out of the effect of pseudo-deletion of data. We provided a preliminary explanation of this phenomenon from the perspective of model classification boundaries and observed that this pseudo-deletion had advantages over direct deletion in the data debiasing problem. Based on the above findings, we proposed Debiasing Backdoor Attack (DBA). It achieves SOTA in the debiasing task and has a broader application scenario than undersampling.
    Towards technological adaptation of advanced farming through AI, IoT, and Robotics: A Comprehensive overview. (arXiv:2202.10459v1 [cs.AI])
    The population explosion of the 21st century has adversely affected the natural resources with restricted availability of cultivable land, increased average temperatures due to global warming, and carbon footprint resulting in a drastic increase in floods as well as droughts thus making food security significant anxiety for most countries. The traditional methods were no longer sufficient which paved the way for technological ascents such as a substantial rise in Artificial Intelligence (AI), Internet of Things (IoT), as well as Robotics that provides high productivity, functional efficiency, flexibility, cost-effectiveness in the domain of agriculture. AI, IoT, and Robotics-based devices and methods have produced new paradigms and opportunities in agriculture. AI's existing approaches are soil management, crop diseases identification, weed identification, and management in collaboration with IoT devices. IoT has utilized automatic agricultural operations and real-time monitoring with few personnel employed in real-time. The major existing applications of agricultural robotics are for the function of soil preparation, planting, monitoring, harvesting, and storage. In this paper, researchers have explored a comprehensive overview of recent implementation, scopes, opportunities, challenges, limitations, and future research instructions of AI, IoT, and Robotics based methodology in the agriculture sector.
    A Novel Architecture Slimming Method for Network Pruning and Knowledge Distillation. (arXiv:2202.10461v1 [cs.CV])
    Network pruning and knowledge distillation are two widely-known model compression methods that efficiently reduce computation cost and model size. A common problem in both pruning and distillation is to determine compressed architecture, i.e., the exact number of filters per layer and layer configuration, in order to preserve most of the original model capacity. In spite of the great advances in existing works, the determination of an excellent architecture still requires human interference or tremendous experimentations. In this paper, we propose an architecture slimming method that automates the layer configuration process. We start from the perspective that the capacity of the over-parameterized model can be largely preserved by finding the minimum number of filters preserving the maximum parameter variance per layer, resulting in a thin architecture. We formulate the determination of compressed architecture as a one-step orthogonal linear transformation, and integrate principle component analysis (PCA), where the variances of filters in the first several projections are maximized. We demonstrate the rationality of our analysis and the effectiveness of the proposed method through extensive experiments. In particular, we show that under the same overall compression rate, the compressed architecture determined by our method shows significant performance gain over baselines after pruning and distillation. Surprisingly, we find that the resulting layer-wise compression rates correspond to the layer sensitivities found by existing works through tremendous experimentations.
    Open-Set Support Vector Machines. (arXiv:1606.03802v11 [cs.LG] UPDATED)
    Often, when dealing with real-world recognition problems, we do not need, and often cannot have, knowledge of the entire set of possible classes that might appear during operational testing. In such cases, we need to think of robust classification methods able to deal with the "unknown" and properly reject samples belonging to classes never seen during training. Notwithstanding, existing classifiers to date were mostly developed for the closed-set scenario, i.e., the classification setup in which it is assumed that all test samples belong to one of the classes with which the classifier was trained. In the open-set scenario, however, a test sample can belong to none of the known classes and the classifier must properly reject it by classifying it as unknown. In this work, we extend upon the well-known Support Vector Machines (SVM) classifier and introduce the Open-Set Support Vector Machines (OSSVM), which is suitable for recognition in open-set setups. OSSVM balances the empirical risk and the risk of the unknown and ensures that the region of the feature space in which a test sample would be classified as known (one of the known classes) is always bounded, ensuring a finite risk of the unknown. In this work, we also highlight the properties of the SVM classifier related to the open-set scenario, and provide necessary and sufficient conditions for an RBF SVM to have bounded open-space risk.
    Moment Matching Deep Contrastive Latent Variable Models. (arXiv:2202.10560v1 [cs.LG])
    In the contrastive analysis (CA) setting, machine learning practitioners are specifically interested in discovering patterns that are enriched in a target dataset as compared to a background dataset generated from sources of variation irrelevant to the task at hand. For example, a biomedical data analyst may seek to understand variations in genomic data only present among patients with a given disease as opposed to those also present in healthy control subjects. Such scenarios have motivated the development of contrastive latent variable models to isolate variations unique to these target datasets from those shared across the target and background datasets, with current state of the art models based on the variational autoencoder (VAE) framework. However, previously proposed models do not explicitly enforce the constraints on latent variables underlying CA, potentially leading to the undesirable leakage of information between the two sets of latent variables. Here we propose the moment matching contrastive VAE (MM-cVAE), a reformulation of the VAE for CA that uses the maximum mean discrepancy to explicitly enforce two crucial latent variable constraints underlying CA. On three challenging CA tasks we find that our method outperforms the previous state-of-the-art both qualitatively and on a set of quantitative metrics.
    A Classical-Quantum Convolutional Neural Network for Detecting Pneumonia from Chest Radiographs. (arXiv:2202.10452v1 [cs.CV])
    While many quantum computing techniques for machine learning have been proposed, their performance on real-world datasets remains to be studied. In this paper, we explore how a variational quantum circuit could be integrated into a classical neural network for the problem of detecting pneumonia from chest radiographs. We substitute one layer of a classical convolutional neural network with a variational quantum circuit to create a hybrid neural network. We train both networks on an image dataset containing chest radiographs and benchmark their performance. To mitigate the influence of different sources of randomness in network training, we sample the results over multiple rounds. We show that the hybrid network outperforms the classical network on different performance measures, and that these improvements are statistically significant. Our work serves as an experimental demonstration of the potential of quantum computing to significantly improve neural network performance for real-world, non-trivial problems relevant to society and industry.
    Robust and Provable Guarantees for Sparse Random Embeddings. (arXiv:2202.10815v1 [cs.LG])
    In this work, we improve upon the guarantees for sparse random embeddings, as they were recently provided and analyzed by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'19). Specifically, we show that (a) our bounds are explicit as opposed to the asymptotic guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants across a wide range of parameters, including the dimensionality, sparsity and dispersion of the data. Moreover, we empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets, such as collections of images, text documents represented as bags-of-words, and text sequences vectorized by neural embeddings. Behind our numerical improvements are techniques of broader interest, which improve upon key steps of previous analyses in terms of (c) tighter estimates for certain types of quadratic chaos, (d) establishing extreme properties of sparse linear forms, and (e) improvements on bounds for the estimation of sums of independent random variables.
    Single-Leg Revenue Management with Advice. (arXiv:2202.10939v1 [cs.GT])
    Single-leg revenue management is a foundational problem of revenue management that has been particularly impactful in the airline and hotel industry: Given $n$ units of a resource, e.g. flight seats, and a stream of sequentially-arriving customers segmented by fares, what is the optimal online policy for allocating the resource. Previous work focused on designing algorithms when forecasts are available, which are not robust to inaccuracies in the forecast, or online algorithms with worst-case performance guarantees, which can be too conservative in practice. In this work, we look at the single-leg revenue management problem through the lens of the algorithms-with-advice framework, which attempts to optimally incorporate advice/predictions about the future into online algorithms. In particular, we characterize the Pareto frontier that captures the tradeoff between consistency (performance when advice is accurate) and competitiveness (performance when advice is inaccurate) for every advice. Moreover, we provide an online algorithm that always achieves performance on this Pareto frontier. We also study the class of protection level policies, which is the most widely-deployed technique for single-leg revenue management: we provide an algorithm to incorporate advice into protection levels that optimally trades off consistency and competitiveness. Moreover, we empirically evaluate the performance of these algorithms on synthetic data. We find that our algorithm for protection level policies performs remarkably well on most instances, even if it is not guaranteed to be on the Pareto frontier in theory.
    An adaptive neuro-fuzzy model for attitude estimation and 2 control a 3 DOF system. (arXiv:2004.09756v3 [eess.SY] UPDATED)
    In recent decades, one of the scientists' main concerns has been to improve the accuracy of satellite attitude, regardless of the expense. The obvious result is that a large number of control strategies have been used to address this problem. In this study, an adaptive neuro-fuzzy integrated (ANFIS) satellite attitude estimation and control system was developed. The controller is trained with the data provided by an optimal controller. A pulse modulator is used to generate the right ON/OFF commands of the thruster actuator. To evaluate the performance of the AN-FIS controller in closed-loop simulation, an ANFIS observer is used to estimate the attitude and angular velocities of the satellite using magnetometer, sun sensor and data gyro data. In addition, a new ANFIS system will be proposed and evaluated that can jointly control and estimate the system. The performance of the ANFIS controller is compared to the optimal PID controller in a Monte Carlo simulation with different initial conditions, disturbance and noise. The results show that the ANFIS controller can surpass the optimal PID controller in several aspects, including time and smoothness. In addition, the ANFIS estimator is examined and the results demonstrate the high ability of this designated observers. Both the control and estimation phases are simulated by a single ANFIS subsystem, taking into account the high capacity of ANFIS, and the results of using the ANFIS model are demonstrated.
    Black-box Certification and Learning under Adversarial Perturbations. (arXiv:2006.16520v2 [stat.ML] UPDATED)
    We formally study the problem of classification under adversarial perturbations from a learner's perspective as well as a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce a new setting of black-box certification under limited query budget, and analyze this for various classes of predictors and perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity can imply the existence of a sample efficient robust learner.
    A Graph-based U-Net Model for Predicting Traffic in unseen Cities. (arXiv:2202.06725v2 [cs.LG] UPDATED)
    Accurate traffic prediction is a key ingredient to enable traffic management like rerouting cars to reduce road congestion or regulating traffic via dynamic speed limits to maintain a steady flow. A way to represent traffic data is in the form of temporally changing heatmaps visualizing attributes of traffic, such as speed and volume. In recent works, U-Net models have shown SOTA performance on traffic forecasting from heatmaps. We propose to combine the U-Net architecture with graph layers which improves spatial generalization to unseen road networks compared to a Vanilla U-Net. In particular, we specialize existing graph operations to be sensitive to geographical topology and generalize pooling and upsampling operations to be applicable to graphs.
    Increasing Depth of Neural Networks for Life-long Learning. (arXiv:2202.10821v1 [cs.LG])
    Increasing neural network depth is a well-known method for improving neural network performance. Modern deep architectures contain multiple mechanisms that allow hundreds or even thousands of layers to train. This work is trying to answer if extending neural network depth may be beneficial in a life-long learning setting. In particular, we propose a novel method based on adding new layers on top of existing ones to enable the forward transfer of knowledge and adapting previously learned representations for new tasks. We utilize a method of determining the most similar tasks for selecting the best location in our network to add new nodes with trainable parameters. This approach allows for creating a tree-like model, where each node is a set of neural network parameters dedicated to a specific task. The proposed method is inspired by Progressive Neural Network (PNN) concept, therefore it is rehearsal-free and benefits from dynamic change of network structure. However, it requires fewer parameters per task than PNN. Experiments on Permuted MNIST and SplitCIFAR show that the proposed algorithm is on par with other continual learning methods. We also perform ablation studies to clarify the contributions of each system part.
    Statistical and Spatio-temporal Hand Gesture Features for Sign Language Recognition using the Leap Motion Sensor. (arXiv:2202.11005v1 [cs.CV])
    In modern society, people should not be identified based on their disability, rather, it is environments that can disable people with impairments. Improvements to automatic Sign Language Recognition (SLR) will lead to more enabling environments via digital technology. Many state-of-the-art approaches to SLR focus on the classification of static hand gestures, but communication is a temporal activity, which is reflected by many of the dynamic gestures present. Given this, temporal information during the delivery of a gesture is not often considered within SLR. The experiments in this work consider the problem of SL gesture recognition regarding how dynamic gestures change during their delivery, and this study aims to explore how single types of features as well as mixed features affect the classification ability of a machine learning model. 18 common gestures recorded via a Leap Motion Controller sensor provide a complex classification problem. Two sets of features are extracted from a 0.6 second time window, statistical descriptors and spatio-temporal attributes. Features from each set are compared by their ANOVA F-Scores and p-values, arranged into bins grown by 10 features per step to a limit of the 250 highest-ranked features. Results show that the best statistical model selected 240 features and scored 85.96% accuracy, the best spatio-temporal model selected 230 features and scored 80.98%, and the best mixed-feature model selected 240 features from each set leading to a classification accuracy of 86.75%. When all three sets of results are compared (146 individual machine learning models), the overall distribution shows that the minimum results are increased when inputs are any number of mixed features compared to any number of either of the two single sets of features.
    Incentive Mechanism Design for Joint Resource Allocation in Blockchain-based Federated Learning. (arXiv:2202.10938v1 [cs.CR])
    Blockchain-based federated learning (BCFL) has recently gained tremendous attention because of its advantages such as decentralization and privacy protection of raw data. However, there has been few research focusing on the allocation of resources for clients in BCFL. In the BCFL framework where the FL clients and the blockchain miners are the same devices, clients broadcast the trained model updates to the blockchain network and then perform mining to generate new blocks. Since each client has a limited amount of computing resources, the problem of allocating computing resources into training and mining needs to be carefully addressed. In this paper, we design an incentive mechanism to assign each client appropriate rewards for training and mining, and then the client will determine the amount of computing power to allocate for each subtask based on these rewards using the two-stage Stackelberg game. After analyzing the utilities of the model owner (MO) (i.e., the BCFL task publisher) and clients, we transform the game model into two optimization problems, which are sequentially solved to derive the optimal strategies for both the MO and clients. Further, considering the fact that local training related information of each client may not be known by others, we extend the game model with analytical solutions to the incomplete information scenario. Extensive experimental results demonstrate the validity of our proposed schemes.
    Differentially Private Estimation of Heterogeneous Causal Effects. (arXiv:2202.11043v1 [stat.ML])
    Estimating heterogeneous treatment effects in domains such as healthcare or social science often involves sensitive data where protecting privacy is important. We introduce a general meta-algorithm for estimating conditional average treatment effects (CATE) with differential privacy (DP) guarantees. Our meta-algorithm can work with simple, single-stage CATE estimators such as S-learner and more complex multi-stage estimators such as DR and R-learner. We perform a tight privacy analysis by taking advantage of sample splitting in our meta-algorithm and the parallel composition property of differential privacy. In this paper, we implement our approach using DP-EBMs as the base learner. DP-EBMs are interpretable, high-accuracy models with privacy guarantees, which allow us to directly observe the impact of DP noise on the learned causal model. Our experiments show that multi-stage CATE estimators incur larger accuracy loss than single-stage CATE or ATE estimators and that most of the accuracy loss from differential privacy is due to an increase in variance, not biased estimates of treatment effects.
    A Clustering Preserving Transformation for k-Means Algorithm Output. (arXiv:2202.10455v1 [cs.LG])
    This note introduces a novel clustering preserving transformation of cluster sets obtained from $k$-means algorithm. This transformation may be used to generate new labeled data{}sets from existent ones. It is more flexible that Kleinberg axiom based consistency transformation because data points in a cluster can be moved away and datapoints between clusters may come closer together.
    A Review of Affective Generation Models. (arXiv:2202.10763v1 [cs.LG])
    Affective computing is an emerging interdisciplinary field where computational systems are developed to analyze, recognize, and influence the affective states of a human. It can generally be divided into two subproblems: affective recognition and affective generation. Affective recognition has been extensively reviewed multiple times in the past decade. Affective generation, however, lacks a critical review. Therefore, we propose to provide a comprehensive review of affective generation models, as models are most commonly leveraged to affect others' emotional states. Affective computing has gained momentum in various fields and applications, thanks to the leap of machine learning, especially deep learning since 2015. With critical models introduced, this work is believed to benefit future research on affective generation. We conclude this work with a brief discussion on existing challenges.
    Adaptive Cholesky Gaussian Processes. (arXiv:2202.10769v1 [cs.LG])
    We present a method to fit exact Gaussian process models to large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed. Empirically, we show that our method can be directly plugged into well-known inference schemes to fit exact Gaussian process models to large datasets.
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v1 [stat.ML])
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of the proposed method is available at https://github.com/RunzheStat/CausalMARL.
    Variational Neural Temporal Point Process. (arXiv:2202.10585v1 [cs.LG])
    A temporal point process is a stochastic process that predicts which type of events is likely to happen and when the event will occur given a history of a sequence of events. There are various examples of occurrence dynamics in the daily life, and it is important to train the temporal dynamics and solve two different prediction problems, time and type predictions. Especially, deep neural network based models have outperformed the statistical models, such as Hawkes processes and Poisson processes. However, many existing approaches overfit to specific events, instead of learning and predicting various event types. Therefore, such approaches could not cope with the modified relationships between events and fail to predict the intensity functions of temporal point processes very well. In this paper, to solve these problems, we propose a variational neural temporal point process (VNTPP). We introduce the inference and the generative networks, and train a distribution of latent variable to deal with stochastic property on deep neural network. The intensity functions are computed using the distribution of latent variable so that we can predict event types and the arrival times of the events more accurately. We empirically demonstrate that our model can generalize the representations of various event types. Moreover, we show quantitatively and qualitatively that our model outperforms other deep neural network based models and statistical processes on synthetic and real-world datasets.
    Equivariant Graph Hierarchy-Based Neural Networks. (arXiv:2202.10643v1 [cs.LG])
    Equivariant Graph neural Networks (EGNs) are powerful in characterizing the dynamics of multi-body physical systems. Existing EGNs conduct flat message passing, which, yet, is unable to capture the spatial/dynamical hierarchy for complex systems particularly, limiting substructure discovery and global information fusion. In this paper, we propose Equivariant Hierarchy-based Graph Networks (EGHNs) which consist of the three key components: generalized Equivariant Matrix Message Passing (EMMP) , E-Pool and E-UpPool. In particular, EMMP is able to improve the expressivity of conventional equivariant message passing, E-Pool assigns the quantities of the low-level nodes into high-level clusters, while E-UpPool leverages the high-level information to update the dynamics of the low-level nodes. As their names imply, both E-Pool and E-UpPool are guaranteed to be equivariant to meet physic symmetry. Considerable experimental evaluations verify the effectiveness of our EGHN on several applications including multi-object dynamics simulation, motion capture, and protein dynamics modeling.
    Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning. (arXiv:2202.10678v1 [cs.AI])
    In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers. This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.
    Reward-Free Policy Space Compression for Reinforcement Learning. (arXiv:2202.11079v1 [cs.LG])
    In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, the policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computation inefficiencies. However, we argue that a limited number of policies are actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum R\'enyi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover an adversarial policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.
    Accelerating Primal-dual Methods for Regularized Markov Decision Processes. (arXiv:2202.10506v1 [math.OC])
    Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.
    On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization. (arXiv:2202.10669v1 [stat.ML])
    Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.
    Unleashing the Power of Transformer for Graphs. (arXiv:2202.10581v1 [cs.LG])
    Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. The computational complexity is unacceptable for large-scale graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer to attend to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET). DET has a structural encoder to aggregate information from connected neighbors and a semantic encoder to focus on semantically useful distant nodes. In comparison with resorting to multi-hop neighbors, DET seeks the desired distant neighbors via self-supervised training. We further find these two encoders can be incorporated to boost each others' performance. Our experiments demonstrate DET has achieved superior performance compared to the respective state-of-the-art methods in dealing with molecules, networks and knowledge graphs with various sizes.
    ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges. (arXiv:2202.10659v1 [cs.CV])
    This paper describes the third Affective Behavior Analysis in-the-wild (ABAW) Competition, held in conjunction with IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. The 3rd ABAW Competition is a continuation of the Competitions held at ICCV 2021, IEEE FG 2020 and IEEE CVPR 2017 Conferences, and aims at automatically analyzing affect. This year the Competition encompasses four Challenges: i) uni-task Valence-Arousal Estimation, ii) uni-task Expression Classification, iii) uni-task Action Unit Detection, and iv) Multi-Task-Learning. All the Challenges are based on a common benchmark database, Aff-Wild2, which is a large scale in-the-wild database and the first one to be annotated in terms of valence-arousal, expressions and action units. In this paper, we present the four Challenges, with the utilized Competition corpora, we outline the evaluation metrics and present the baseline systems along with their obtained results.  ( 2 min )
    Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning. (arXiv:2202.10629v1 [cs.LG])
    In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces the following challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper introduces a new technique called model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation on the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities. A list of model reprogramming studies is actively maintained and updated at https://github.com/IBM/model-reprogramming.  ( 2 min )
    CD-ROM: Complementary Deep-Reduced Order Model. (arXiv:2202.10746v1 [physics.flu-dyn])
    Model order reduction through the POD-Galerkin method can lead to dramatic gains in terms of computational efficiency in solving physical problems. However, the applicability of the method to non linear high-dimensional dynamical systems such as the Navier-Stokes equations has been shown to be limited, producing inaccurate and sometimes unstable models. This paper proposes a closure modeling approach for classical POD-Galerkin reduced order models (ROM). We use multi layer perceptrons (MLP) to learn a continuous in time closure model through the recently proposed Neural ODE method. Inspired by Taken's theorem as well as the Mori-Zwanzig formalism, we augment ROMs with a delay differential equation architecture to model non-Markovian effects in reduced models. The proposed model, called CD-ROM (Complementary Deep-Reduced Order Model) is able to retain information from past states of the system and use it to correct the imperfect reduced dynamics. The model can be integrated in time as a system of ordinary differential equations using any classical time marching scheme. We demonstrate the ability of our CD-ROM approach to improve the accuracy of POD-Galerkin models on two CFD examples, even in configurations unseen during training.  ( 2 min )
    Privacy Leakage of Adversarial Training Models in Federated Learning Systems. (arXiv:2202.10546v1 [cs.LG])
    Adversarial Training (AT) is crucial for obtaining deep neural networks that are robust to adversarial attacks, yet recent works found that it could also make models more vulnerable to privacy attacks. In this work, we further reveal this unsettling property of AT by designing a novel privacy attack that is practically applicable to the privacy-sensitive Federated Learning (FL) systems. Using our method, the attacker can exploit AT models in the FL system to accurately reconstruct users' private training images even when the training batch size is large. Code is available at https://github.com/zjysteven/PrivayAttack_AT_FL.
    Imbalanced Classification via Explicit Gradient Learning From Augmented Data. (arXiv:2202.10550v1 [cs.LG])
    Learning from imbalanced data is one of the most significant challenges in real-world classification tasks. In such cases, neural networks performance is substantially impaired due to preference towards the majority class. Existing approaches attempt to eliminate the bias through data re-sampling or re-weighting the loss in the learning process. Still, these methods tend to overfit the minority samples and perform poorly when the structure of the minority class is highly irregular. Here, we propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances. These additional data are incorporated in the classifier's deep-learning process, and their contributions are learned explicitly. The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
    Ligandformer: A Graph Neural Network for Predicting Ligand Property with Robust Interpretation. (arXiv:2202.10873v1 [q-bio.BM])
    Robust and efficient interpretation of QSAR methods is quite useful to validate AI prediction rationales with subjective opinion (chemist or biologist expertise), understand sophisticated chemical or biological process mechanisms, and provide heuristic ideas for structure optimization in pharmaceutical industry. For this purpose, we construct a multi-layer self-attention based Graph Neural Network framework, namely Ligandformer, for predicting ligand property with interpretation. Ligandformer integrates attention maps on ligand structure from different network blocks. The integrated attention map reflects the machine's local interest on compound structure, and indicates the relationship between predicted compound property and its structure. This work mainly contributes to three aspects: 1. Ligandformer directly opens the black-box of deep learning methods, providing local prediction rationales on chemical structures. 2. Ligandformer gives robust prediction in different experimental rounds, overcoming the ubiquitous prediction instability of deep learning methods. 3. Ligandformer can be generalized to predict different chemical or biological properties with high performance. Furthermore, Ligandformer can simultaneously output specific property score and visible attention map on structure, which can support researchers to investigate chemical or biological property and optimize structure efficiently. Our framework outperforms over counterparts in terms of accuracy, robustness and generalization, and can be applied in complex system study.  ( 2 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v1 [stat.ML])
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the correct data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation on image datasets.  ( 2 min )
  • Open

    Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits. (arXiv:2106.14866v2 [stat.ML] UPDATED)
    We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation -- free of identifiability issues -- is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.
    Permutation-equivariant and Proximity-aware Graph Neural Networks with Stochastic Message Passing. (arXiv:2009.02562v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such as finding communities and leaders. In this paper, we first analytically show that the existing GNNs, mostly based on the message-passing mechanism, cannot simultaneously preserve the two properties. Then, we propose Stochastic Message Passing (SMP) model, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities, and at the same time, maintain permutation-equivariance with certain parametrization. We report extensive experimental results on ten datasets and demonstrate the effectiveness and efficiency of SMP for various typical graph mining tasks, including graph reconstruction, node classification, and link prediction.
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v11 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history. To tackle this longitudinal treatment effect estimation problem, we have developed a causal dynamic survival (CDS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of past, time-dependent covariates and treatments. Using simulated survival datasets, the CDS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In two large clinical cohort studies, CDS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. CDS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v3 [cs.LG] UPDATED)
    Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the agent would incur when interacting with the Gaussian bandit, we bound the increase in regret when the agent interacts with the Bernoulli bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows at a rate which is at most linear in the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Harmless interpolation in regression and classification with structured features. (arXiv:2111.05198v2 [stat.ML] UPDATED)
    Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v2 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which unobserved factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with explanatory insights. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and two empirical databases: 1) the Clinical Practice Research Datalink (CPRD) and 2) the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing the continuous treatment effect with observational data in a medical context.
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v2 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.
    Neural Ensemble Search for Uncertainty Estimation and Dataset Shift. (arXiv:2006.08573v3 [cs.LG] UPDATED)
    Ensembles of neural networks achieve superior performance compared to stand-alone networks in terms of accuracy, uncertainty calibration and robustness to dataset shift. \emph{Deep ensembles}, a state-of-the-art method for uncertainty estimation, only ensemble random initializations of a \emph{fixed} architecture. Instead, we propose two methods for automatically constructing ensembles with \emph{varying} architectures, which implicitly trade-off individual architectures' strengths against the ensemble's diversity and exploit architectural variation as a source of diversity. On a variety of classification tasks and modern architecture search spaces, we show that the resulting ensembles outperform deep ensembles not only in terms of accuracy but also uncertainty calibration and robustness to dataset shift. Our further analysis and ablation studies provide evidence of higher ensemble diversity due to architectural variation, resulting in ensembles that can outperform deep ensembles, even when having weaker average base learners. To foster reproducibility, our code is available: \url{https://github.com/automl/nes}
    How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization. (arXiv:2111.08706v2 [cs.NE] UPDATED)
    The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.
    Mixture-of-Variational-Experts for Continual Learning. (arXiv:2110.12667v3 [cs.LG] UPDATED)
    One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.
    Black-box Certification and Learning under Adversarial Perturbations. (arXiv:2006.16520v2 [stat.ML] UPDATED)
    We formally study the problem of classification under adversarial perturbations from a learner's perspective as well as a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce a new setting of black-box certification under limited query budget, and analyze this for various classes of predictors and perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity can imply the existence of a sample efficient robust learner.
    A Cram\'er Distance perspective on Quantile Regression based Distributional Reinforcement Learning. (arXiv:2110.00535v2 [stat.ML] UPDATED)
    Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR) based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cram\'er distance.
    Open-Set Support Vector Machines. (arXiv:1606.03802v11 [cs.LG] UPDATED)
    Often, when dealing with real-world recognition problems, we do not need, and often cannot have, knowledge of the entire set of possible classes that might appear during operational testing. In such cases, we need to think of robust classification methods able to deal with the "unknown" and properly reject samples belonging to classes never seen during training. Notwithstanding, existing classifiers to date were mostly developed for the closed-set scenario, i.e., the classification setup in which it is assumed that all test samples belong to one of the classes with which the classifier was trained. In the open-set scenario, however, a test sample can belong to none of the known classes and the classifier must properly reject it by classifying it as unknown. In this work, we extend upon the well-known Support Vector Machines (SVM) classifier and introduce the Open-Set Support Vector Machines (OSSVM), which is suitable for recognition in open-set setups. OSSVM balances the empirical risk and the risk of the unknown and ensures that the region of the feature space in which a test sample would be classified as known (one of the known classes) is always bounded, ensuring a finite risk of the unknown. In this work, we also highlight the properties of the SVM classifier related to the open-set scenario, and provide necessary and sufficient conditions for an RBF SVM to have bounded open-space risk.  ( 3 min )
    Unadjusted Langevin algorithm for sampling a mixture of weakly smooth potentials. (arXiv:2112.09311v2 [stat.CO] UPDATED)
    Discretization of continuous-time diffusion processes is a widely recognized method for sampling. However, it seems to be a considerable restriction when the potentials are often required to be smooth (gradient Lipschitz). This paper studies the problem of sampling through Euler discretization, where the potential function is assumed to be a mixture of weakly smooth distributions and satisfies weakly dissipative. We establish the convergence in Kullback-Leibler (KL) divergence with the number of iterations to reach $\epsilon$-neighborhood of a target distribution in only polynomial dependence on the dimension. We relax the degenerated convex at infinity conditions of \citet{erdogdu2020convergence} and prove convergence guarantees under Poincar\'{e} inequality or non-strongly convex outside the ball. In addition, we also provide convergence in $L_{\beta}$-Wasserstein metric for the smoothing potential.  ( 2 min )
    An adaptive neuro-fuzzy model for attitude estimation and 2 control a 3 DOF system. (arXiv:2004.09756v3 [eess.SY] UPDATED)
    In recent decades, one of the scientists' main concerns has been to improve the accuracy of satellite attitude, regardless of the expense. The obvious result is that a large number of control strategies have been used to address this problem. In this study, an adaptive neuro-fuzzy integrated (ANFIS) satellite attitude estimation and control system was developed. The controller is trained with the data provided by an optimal controller. A pulse modulator is used to generate the right ON/OFF commands of the thruster actuator. To evaluate the performance of the AN-FIS controller in closed-loop simulation, an ANFIS observer is used to estimate the attitude and angular velocities of the satellite using magnetometer, sun sensor and data gyro data. In addition, a new ANFIS system will be proposed and evaluated that can jointly control and estimate the system. The performance of the ANFIS controller is compared to the optimal PID controller in a Monte Carlo simulation with different initial conditions, disturbance and noise. The results show that the ANFIS controller can surpass the optimal PID controller in several aspects, including time and smoothness. In addition, the ANFIS estimator is examined and the results demonstrate the high ability of this designated observers. Both the control and estimation phases are simulated by a single ANFIS subsystem, taking into account the high capacity of ANFIS, and the results of using the ANFIS model are demonstrated.  ( 2 min )
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v3 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.  ( 2 min )
    Efficient and Differentiable Conformal Prediction with General Function Classes. (arXiv:2202.11091v1 [cs.LG])
    Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.  ( 2 min )
    Differentially Private Estimation of Heterogeneous Causal Effects. (arXiv:2202.11043v1 [stat.ML])
    Estimating heterogeneous treatment effects in domains such as healthcare or social science often involves sensitive data where protecting privacy is important. We introduce a general meta-algorithm for estimating conditional average treatment effects (CATE) with differential privacy (DP) guarantees. Our meta-algorithm can work with simple, single-stage CATE estimators such as S-learner and more complex multi-stage estimators such as DR and R-learner. We perform a tight privacy analysis by taking advantage of sample splitting in our meta-algorithm and the parallel composition property of differential privacy. In this paper, we implement our approach using DP-EBMs as the base learner. DP-EBMs are interpretable, high-accuracy models with privacy guarantees, which allow us to directly observe the impact of DP noise on the learned causal model. Our experiments show that multi-stage CATE estimators incur larger accuracy loss than single-stage CATE or ATE estimators and that most of the accuracy loss from differential privacy is due to an increase in variance, not biased estimates of treatment effects.  ( 2 min )
    Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums. (arXiv:2105.12062v2 [math.OC] UPDATED)
    In convex optimization, the problem of finding near-stationary points has not been adequately studied yet, unlike other optimality measures such as the function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of algorithmic techniques for finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for minimizing both the gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future work.  ( 2 min )
    Near-Optimal Performance Bounds for Orthogonal and Permutation Group Synchronization via Spectral Methods. (arXiv:2008.05341v3 [cs.IT] UPDATED)
    Group synchronization asks to recover group elements from their pairwise measurements. It has found numerous applications across various scientific disciplines. In this work, we focus on orthogonal and permutation group synchronization which are widely used in computer vision such as object matching and structure from motion. Among many available approaches, the spectral methods have enjoyed great popularity due to their efficiency and convenience. We will study the performance guarantees of the spectral methods in solving these two synchronization problems by investigating how well the computed eigenvectors approximate each group element individually. We establish our theory by applying the recent popular~\emph{leave-one-out} technique and derive a~\emph{block-wise} performance bound for the recovery of each group element via eigenvectors. In particular, for orthogonal group synchronization, we obtain a near-optimal performance bound for the group recovery in presence of additive Gaussian noise. For permutation group synchronization under random corruption, we show that the widely-used two-step procedure (spectral method plus rounding) can recover all the group elements exactly if the SNR (signal-to-noise ratio) is close to the information theoretical limit. Our numerical experiments confirm our theory and indicate a sharp phase transition for the exact group recovery.  ( 2 min )
    Robust and Provable Guarantees for Sparse Random Embeddings. (arXiv:2202.10815v1 [cs.LG])
    In this work, we improve upon the guarantees for sparse random embeddings, as they were recently provided and analyzed by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'19). Specifically, we show that (a) our bounds are explicit as opposed to the asymptotic guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants across a wide range of parameters, including the dimensionality, sparsity and dispersion of the data. Moreover, we empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets, such as collections of images, text documents represented as bags-of-words, and text sequences vectorized by neural embeddings. Behind our numerical improvements are techniques of broader interest, which improve upon key steps of previous analyses in terms of (c) tighter estimates for certain types of quadratic chaos, (d) establishing extreme properties of sparse linear forms, and (e) improvements on bounds for the estimation of sums of independent random variables.  ( 2 min )
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v2 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm that uses a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the work of Na et al. (2021) by allowing nonlinear inequality constraints without requiring the strict complementary condition. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in CUTEst test set.  ( 2 min )
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v1 [cs.CV] CROSS LISTED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    PyTorch Geometric Signed Directed: A Survey and Software on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v1 [cs.LG])
    Signed networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many signed networks are directed, there is a lack of survey papers and software packages on graph neural networks (GNNs) specially designed for directed networks. In this paper, we present PyTorch Geometric Signed Directed, a survey and software on GNNs for signed and directed networks. We review typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, and provide an overview of methods proposed. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. The software is presented in a modular fashion, so that signed and directed networks can also be treated separately. As an extension library for PyTorch Geometric, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.  ( 2 min )
    MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an Exponential Convergence Rate. (arXiv:2202.10923v1 [stat.ML])
    The fluctuation effect of gradient expectation and variance caused by parameter update between consecutive iterations is neglected or confusing by current mainstream gradient optimization algorithms.Using this fluctuation effect, combined with the stratified sampling strategy, this paper designs a novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient Descend(\underline{MST}GD) algorithm with an exponential convergence rate. Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient, which is estimated from the mean and variance of sample gradients before and after iteration, and the other strategy is stratified sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two strategies can be adaptively unbiased, and its variance decays at a geometric rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential convergence rate of the form $\lambda^{2(k-k_0)}$($\lambda\in (0,1)$,k is the number of iteration steps,$\lambda$ is a variable related to proportion p).Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc., and can be achieved at a constant step size.Theoretical and experimental results show the effectiveness of MSTGD  ( 2 min )
    On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization. (arXiv:2202.10669v1 [stat.ML])
    Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.  ( 2 min )
    Connecting Optimization and Generalization via Gradient Flow Path Length. (arXiv:2202.10670v1 [stat.ML])
    Optimization and generalization are two essential aspects of machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the length of optimization trajectory under the gradient flow algorithm after convergence. Through our approach, we show that, with a proper initialization, gradient flow converges following a short path with an explicit length estimate. Such an estimate induces a length-based generalization bound, showing that short optimization paths after convergence are associated with good generalization, which also matches our numerical results. Our framework can be applied to broad settings. For example, we use it to obtain generalization estimates on three distinct machine learning models: underdetermined $\ell_p$ linear regression, kernel regression, and overparameterized two-layer ReLU neural networks.  ( 2 min )
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v1 [cs.LG])
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Causal reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.  ( 2 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v1 [stat.ML])
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.  ( 2 min )
    Random Graph Matching in Geometric Models: the Case of Complete Graphs. (arXiv:2202.10662v1 [math.ST])
    This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\mathbb{R}^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=\langle x, y \rangle$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of [DCK19] and [KNW22] in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of [Ume88] emerges as a further approximation to the maximum likelihood in the geometric model.  ( 2 min )
    Convex Loss Functions for Contextual Pricing with Observational Posted-Price Data. (arXiv:2202.10944v1 [cs.LG])
    We study an off-policy contextual pricing problem where the seller has access to samples of prices which customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the historic pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this pricing setting which can be directly optimized to find an effective pricing policy with expected revenue guarantees without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We further propose generalized hinge and quantile pricing loss functions, which price at a multiplicative factor of the conditional expected value or a particular quantile of the valuation distribution when optimized, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v1 [stat.ME])
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.  ( 2 min )
    Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics. (arXiv:2202.10723v1 [cs.LG])
    Optimal transport (OT) is a popular measure to compare probability distributions. However, OT suffers a few drawbacks such as (i) a high complexity for computation, (ii) indefiniteness which limits its applicability to kernel machines. In this work, we consider probability measures supported on a graph metric space and propose a novel Sobolev transport metric. We show that the Sobolev transport metric yields a closed-form formula for fast computation and it is negative definite. We show that the space of probability measures endowed with this transport distance is isometric to a bounded convex set in a Euclidean space with a weighted $\ell_p$ distance. We further exploit the negative definiteness of the Sobolev transport to design positive-definite kernels, and evaluate their performances against other baselines in document classification with word embeddings and in topological data analysis.  ( 2 min )
    Explicit Regularization via Regularizer Mirror Descent. (arXiv:2202.10788v1 [cs.LG])
    Despite perfectly interpolating the training data, deep neural networks (DNNs) can often generalize fairly well, in part due to the "implicit regularization" induced by the learning algorithm. Nonetheless, various forms of regularization, such as "explicit regularization" (via weight decay), are often used to avoid overfitting, especially when the data is corrupted. There are several challenges with explicit regularization, most notably unclear convergence properties. Inspired by convergence properties of stochastic mirror descent (SMD) algorithms, we propose a new method for training DNNs with regularization, called regularizer mirror descent (RMD). In highly overparameterized DNNs, SMD simultaneously interpolates the training data and minimizes a certain potential function of the weights. RMD starts with a standard cost which is the sum of the training loss and a convex regularizer of the weights. Reinterpreting this cost as the potential of an "augmented" overparameterized network and applying SMD yields RMD. As a result, RMD inherits the properties of SMD and provably converges to a point "close" to the minimizer of this cost. RMD is computationally comparable to stochastic gradient descent (SGD) and weight decay, and is parallelizable in the same manner. Our experimental results on training sets with various levels of corruption suggest that the generalization performance of RMD is remarkably robust and significantly better than both SGD and weight decay, which implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can also be used to regularize the weights to a desired weight vector, which is particularly relevant for continual learning.  ( 2 min )
    Learning Infomax and Domain-Independent Representations for Causal Effect Inference with Real-World Data. (arXiv:2202.10885v1 [stat.ML])
    The foremost challenge to causal inference with real-world data is to handle the imbalance in the covariates with respect to different treatment options, caused by treatment selection bias. To address this issue, recent literature has explored domain-invariant representation learning based on different domain divergence metrics (e.g., Wasserstein distance, maximum mean discrepancy, position-dependent metric, and domain overlap). In this paper, we reveal the weaknesses of these strategies, i.e., they lead to the loss of predictive information when enforcing the domain invariance; and the treatment effect estimation performance is unstable, which heavily relies on the characteristics of the domain distributions and the choice of domain divergence metrics. Motivated by information theory, we propose to learn the Infomax and Domain-Independent Representations to solve the above puzzles. Our method utilizes the mutual information between the global feature representations and individual feature representations, and the mutual information between feature representations and treatment assignment predictions, in order to maximally capture the common predictive information for both treatment and control groups. Moreover, our method filters out the influence of instrumental and irrelevant variables, and thus it effectively increases the predictive ability of potential outcomes. Experimental results on both the synthetic and real-world datasets show that our method achieves state-of-the-art performance on causal effect inference. Moreover, our method exhibits reliable prediction performances when facing data with different characteristics of data distributions, complicated variable types, and severe covariate imbalance.  ( 2 min )
    Distributed Sparse Multicategory Discriminant Analysis. (arXiv:2202.10913v1 [math.ST])
    This paper proposes a convex formulation for sparse multicategory linear discriminant analysis and then extend it to the distributed setting when data are stored across multiple sites. The key observation is that for the purpose of classification it suffices to recover the discriminant subspace which is invariant to orthogonal transformations. Theoretically, we establish statistical properties ensuring that the distributed sparse multicategory linear discriminant analysis performs as good as the centralized version after {a few rounds} of communications. Numerical studies lend strong support to our methodology and theory.  ( 2 min )
    CD-ROM: Complementary Deep-Reduced Order Model. (arXiv:2202.10746v1 [physics.flu-dyn])
    Model order reduction through the POD-Galerkin method can lead to dramatic gains in terms of computational efficiency in solving physical problems. However, the applicability of the method to non linear high-dimensional dynamical systems such as the Navier-Stokes equations has been shown to be limited, producing inaccurate and sometimes unstable models. This paper proposes a closure modeling approach for classical POD-Galerkin reduced order models (ROM). We use multi layer perceptrons (MLP) to learn a continuous in time closure model through the recently proposed Neural ODE method. Inspired by Taken's theorem as well as the Mori-Zwanzig formalism, we augment ROMs with a delay differential equation architecture to model non-Markovian effects in reduced models. The proposed model, called CD-ROM (Complementary Deep-Reduced Order Model) is able to retain information from past states of the system and use it to correct the imperfect reduced dynamics. The model can be integrated in time as a system of ordinary differential equations using any classical time marching scheme. We demonstrate the ability of our CD-ROM approach to improve the accuracy of POD-Galerkin models on two CFD examples, even in configurations unseen during training.  ( 2 min )
    Batched Dueling Bandits. (arXiv:2202.10660v1 [cs.LG])
    The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.  ( 2 min )
    Confident Neural Network Regression with Bootstrapped Deep Ensembles. (arXiv:2202.10903v1 [stat.ML])
    With the rise of the popularity and usage of neural networks, trustworthy uncertainty estimation is becoming increasingly essential. In this paper we present a computationally cheap extension of Deep Ensembles for a regression setting called Bootstrapped Deep Ensembles that explicitly takes the effect of finite data into account using a modified version of the parametric bootstrap. We demonstrate through a simulation study that our method has comparable or better prediction intervals and superior confidence intervals compared to Deep Ensembles and other state-of-the-art methods. As an added bonus, our method is better capable of detecting overfitting than standard Deep Ensembles.  ( 2 min )
    Message passing all the way up. (arXiv:2202.11097v1 [cs.LG])
    The message passing framework is the foundation of the immense success enjoyed by graph neural networks (GNNs) in recent years. In spite of its elegance, there exist many problems it provably cannot solve over given input graphs. This has led to a surge of research on going "beyond message passing", building GNNs which do not suffer from those limitations -- a term which has become ubiquitous in regular discourse. However, have those methods truly moved beyond message passing? In this position paper, I argue about the dangers of using this term -- especially when teaching graph representation learning to newcomers. I show that any function of interest we want to compute over graphs can, in all likelihood, be expressed using pairwise message passing -- just over a potentially modified graph, and argue how most practical implementations subtly do this kind of trick anyway. Hoping to initiate a productive discussion, I propose replacing "beyond message passing" with a more tame term, "augmented message passing".  ( 2 min )
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v1 [stat.ML])
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v1 [cs.LG])
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.  ( 2 min )
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v1 [stat.ML])
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.  ( 2 min )
    Accelerating Primal-dual Methods for Regularized Markov Decision Processes. (arXiv:2202.10506v1 [math.OC])
    Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.  ( 2 min )
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v1 [stat.ML])
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of the proposed method is available at https://github.com/RunzheStat/CausalMARL.  ( 2 min )
    A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning. (arXiv:2202.10464v1 [cs.NE])
    Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize stochastic constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.  ( 2 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v1 [stat.ML])
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the correct data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation on image datasets.  ( 2 min )
    Order-Optimal Error Bounds for Noisy Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v1 [stat.ML])
    In this paper, we study the sample complexity of {\em noisy Bayesian quadrature} (BQ), in which we seek to approximate an integral based on noisy black-box queries to the underlying function. We consider functions in a {\em Reproducing Kernel Hilbert Space} (RKHS) with the Mat\'ern-$\nu$ kernel, focusing on combinations of the parameter $\nu$ and dimension $d$ such that the RKHS is equivalent to a Sobolev class. In this setting, we provide near-matching upper and lower bounds on the best possible average error. Specifically, we find that when the black-box queries are subject to Gaussian noise having variance $\sigma^2$, any algorithm making at most $T$ queries (even with adaptive sampling) must incur a mean absolute error of $\Omega(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$, and there exists a non-adaptive algorithm attaining an error of at most $O(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$. Hence, the bounds are order-optimal, and establish that there is no adaptivity gap in terms of scaling laws.  ( 2 min )

  • Open

    How Kustomer utilizes custom Docker images & Amazon SageMaker to build a text classification pipeline
    This is a guest post by Kustomer’s Senior Software & Machine Learning Engineer, Ian Lantzy, and AWS team Umesh Kalaspurkar, Prasad Shetty, and Jonathan Greifenberger. In Kustomer’s own words, “Kustomer is the omnichannel SaaS CRM platform reimagining enterprise customer service to deliver standout experiences. Built with intelligent automation, we scale to meet the needs of […]  ( 8 min )
    Build, train, and deploy Amazon Lookout for Equipment models using the Python Toolbox
    Predictive maintenance can be an effective way to prevent industrial machinery failures and expensive downtime by proactively monitoring the condition of your equipment, so you can be alerted to any anomalies before equipment failures occur. Installing sensors and the necessary infrastructure for data connectivity, storage, analytics, and alerting are the foundational elements for enabling predictive […]  ( 9 min )
    Choose the best data source for your Amazon SageMaker training job
    Amazon SageMaker is a managed service that makes it easy to build, train, and deploy machine learning (ML) models. Data scientists use SageMaker training jobs to easily train ML models; you don’t have to worry about managing compute resources, and you pay only for the actual training time. Data ingestion is an integral part of […]  ( 17 min )
    How InpharmD uses Amazon Kendra and Amazon Lex to drive evidence-based patient care
    The intersection of DI and AI: Drug information refers to the discovery, use, and management of healthcare and medical information. Healthcare providers have many challenges associated with drug information discovery, such as intensive time involvement, lack of accessibility, and accuracy of reliable data. The average clinical query requires a literature search that takes an average of 18.5 hours. In addition, drug information often lies in disparate information silos, behind pay walls and design walls, and quickly becomes stale.  ( 5 min )
    Control formality in machine translated text using Amazon Translate
    Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Amazon Translate now supports formality customization. This feature allows you to customize the level of formality in your translation output. At the time of writing, the formality customization feature is available for six target languages: French, German, Hindi, Italian, […]  ( 5 min )
  • Open

    [P] Lambda's Machine Learning Infrastructure Playbook and Best Practices
    I originally put this together for a conference last year and finally got a chance to upload it to youtube. I figured /r/machinelearning might like it --- here's the full document & video presentation associated with it: video presentation: https://www.youtube.com/watch?v=3EnIW0EZkr4 slide pdf: https://files.lambdalabs.com/lambda-machine-learning-infrastructure-playbook.pdf It covers both the business side and technical side of building infrastructure for your team. In the presentation, we summarize a customer survey where 80% of respondents reporting using on-prem infrastructure. That seems pretty high to me and I was wondering if you think that's representative of the community or if our sampling is biased? submitted by /u/sabalaba [link] [comments]  ( 1 min )
    [Discussion] What would be your examples of the different agent architectures?
    Hi everyone, new to the world of AI. I was doing some research and was wondering about all the different examples of agent architectures and which types of examples we could group into them. For example, I'd say that self-driving cars like the Tesla is an example of a goal based agent, although it could be a reinforcement learning agent, but I don't think so since it can't exactly be given a reward, but I could be wrong. Another example could be smart robots - which I think is a reinforcement learning agent. There's 5 different agent types: reflex, model, goal, utility and learning agents. I guess what I'm curious about is what would be your examples for each of the 5? I guess also since we humans are also intelligent agents, we're both goal based and reinforcement learning agents. What do you all think? 🤔🙂 submitted by /u/gonewiththestorm [link] [comments]  ( 1 min )
    [D] EU Artificial Intelligence Act: Risk Levels
    The EU proposes the first regulatory framework for AI that addresses the risks of AI and enables Europe to play a leading role globally. Four risk levels are defined Unacceptable Risk (Prohibited) High-Risk (Conformity Assessment) Limited Risk (Transparency obligation) Minimal Risk (No obligation) More details in the following article (Medium): EU Artificial Intelligence Act: Risk Levels submitted by /u/Philo167 [link] [comments]  ( 1 min )
    [D] Exploiting Prior Knowledge of Class Distribution in Semi-supervised Learning?
    Suppose we are training a classifier and have two datasets to work with: a labeled dataset that is not necessarily a representative sample of the population on which the classifier will be deployed an unlabeled dataset that is representative of the population the classifier will be used on We also have prior knowledge of the class distribution in the general population, and therefore in the unlabeled dataset as well. Does anyone know of any research into using this sort of prior knowledge to improve classification performance over some baseline semi-supervised learning method? For example, methods like consistency regularization favor decision boundaries that pass through regions with low density of unlabeled examples. If we know that the unlabeled set is e.g. 60% positive and 40% negative, is there any way to also encourage decision boundaries that split the unlabeled examples according to the known ratio? Or in the case where one class strongly dominates in the unlabeled data, e.g. 5% positive and 95% negative, can we somehow make use of the fact that unlabeled examples are probably a certain class? ​ Edit: for context, the reason I'm interested in the semi-supervised (vs pure supervised) case is because when all your data is labeled, the class distribution is not really "additional" information-- it's something that a model could directly infer from the training labels. With unlabeled data, the class distribution is additional information that cannot be directly inferred from the raw data. Intuitively, I feel like it should be possible to use this additional information to train a better model, but my literature search isn't turning up any results. submitted by /u/Haycart [link] [comments]  ( 2 min )
    [D] Finding dates for upcoming conferences
    Hey, So how do you guys go about finding the dates of upcoming conferences? Is there a website where you can find a list of upcoming conferences and their tentative deadlines? I did look into WikiCFP, but I think conferences there are added only after the call for papers is announced. Im trying to figure out tentative dates, so I can have a good timeline/target for myself. Thanks submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]  ( 1 min )
    [D] Neural Radiance Fields in Real-World Applications
    Hey everybody, Last year there have been a couple of papers on how to make Neural radiance fields (NeRFs) more efficient so that they can be used for real-time applications. So I was wondering whether there are some public projects or papers that actually use NeRF-like approaches in real-world applications, for example in game-engines or VR/AR (not technical demos). Or perhaps somebody here in the ML community has experience with this? I could not find much regarding this online, and I am generally interested to what extent implicit rendering models are used already. submitted by /u/BreadRollsWithButter [link] [comments]  ( 1 min )
    [N] Upcoming talks for Laplace’s Causal Demon: A Webinar Series about Bayesian Inference and Causal Inference (week of 28 Feb.)
    We have three talks next week (week of 28 Feb.) for our upcoming webinar series about the intersection of Bayesian inference and causal inference. Our speakers will help us understand how we can use these two frameworks in order to solve applied problems, and will consider if these different frameworks are in conflict or are complimentary. The intended audience is machine learning practitioners and statisticians from academia and industry. ​ Upcoming talks, with Zoom registration links: 28 Feb. Christopher Sims - Large Parameter Spaces and Weighted Data: A Bayesian Perspective Bayesian analysis can suggest ignoring sampling weights, even in contexts where popular estimation methods like Horvitz-Thompson or “doubly robust” estimates do use the weights. Weights are sometimes treated in …  ( 2 min )
    [R] Open-Ended Reinforcement Learning with Neural Reward Functions
    Paper Abstract Inspired by the great success of unsupervised learning in Computer Vision and Natural Language Processing, the Reinforcement Learning community has recently started to focus more on unsupervised discovery of skills. Most current approaches, like DIAYN or DADS, optimize some form of mutual information objective. We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior. In high-dimensional robotic environments our approach learns a wide range of interesting skills including front-flips for Half-Cheetah and one-legged running for Humanoid. In the pixel-based Montezuma’s Revenge environment our method also works with minimal changes and it learns complex skills that involve interacting with items and visiting diverse locations. Unsupervised Skills ​ https://i.redd.it/mbhnagfp2mj81.gif Web-link: https://as.inf.ethz.ch/research/open_ended_RL/main.html arXiv link: https://arxiv.org/abs/2202.08266 submitted by /u/asierm [link] [comments]  ( 1 min )
    [D] Sentiment Analysis, Part 2 — How to choose pre-annotated datasets for Sentiment Analysis?
    Hello here 👋 I just posted the second blog post of a series about Sentiment Analysis: https://medium.com/besedo-engineering/sentiment-analysis-part-2-how-to-choose-pre-annotated-datasets-for-sentiment-analysis-d79736e8c147 This blog post is about how to choose the perfect dataset for your Sentiment Analysis project, which is one of the most important part of a project. I hope you'll like it! 🙌 Any feedback is appreciated 🙏 Bonus: There are still two blog posts on this series! submitted by /u/jade_mlc [link] [comments]  ( 1 min )
    [N] MIT talk by Max Tegmark on the Future of AI:
    https://www.youtube.com/watch?v=IGOuV6UyQ1Q&list=PLhCQIYxdniNuu422uTU91PRHoj3gxF0uD&index=3 submitted by /u/Corddot [link] [comments]
    [D] Comparing latent spaces learned on similar/identical data
    I have a very general question about latent spaces. It seems like there are many different neural network architectures that project input data into some sort of latent space and then make classifications, predictions, or generate new data based on that latent space. My question is, are there any accepted practices or standard methods for comparing latent spaces learned on similar or identical datasets? A trivial example would be for an autoencoder. If you had a single dataset, you could train multiple autoencoder architectures and compare how well the input and output match for each architecture. However, latent spaces exist in lots of different applications outside of autoencoders, and it seems like there might be useful ways to compare them beyond "did this reconstruct the input perfectly". For example, two different text-classification neural networks (NN1 and NN2) with latent spaces of the same dimension might project text samples very differently. It might enable NN1 to classify some samples well and others poorly, while NN2 might perform better on the opposite samples. It seems like it might be useful to understand the similarities and differences between each latent space, or maybe to figure out how one latent space maps onto another. Please let me know if my question isn't clear, or if it's trivial. I did some google-ing but I'm not always sure what terms to use. Thanks! submitted by /u/kinnunenenenen [link] [comments]  ( 3 min )
    [D] - What is the best cloud provider with a budget of £250?
    I am looking for a Cloud provider to run Machine Learning jobs for my final year project. The department gives us a budget of £250, which I was thinking of using all for running cloud jobs. What’s the best cloud provider? Are there any special offers for academics? submitted by /u/ats678 [link] [comments]  ( 1 min )
    [Discussion] How do you Automate your Jupyter Notebooks?
    Hi All! I was wondering what are some of the ways that people use to automate their notebook works? The main challenges I had in mind are version control, monitoring, better collaboration within the team, standardization across teams, and fast cloud deployments. ​ Disclosure, I'm part of the Ploomber team (an open source framework) and it solves most if not all of those problems. submitted by /u/idomic [link] [comments]  ( 1 min )
    [R] Deepmind: A data-driven approach for learning to control computers
    submitted by /u/ReasonablyBadass [link] [comments]  ( 1 min )
    [R] A list of Open source tools and research papers on Data Centric AI
    The Data Centric approach to building AI systems focuses on data instead of code. We thought it would be cool if there was a central repository of all things Data Centric AI, so we set out to build one. We have put together a list of research papers and open-source tools on Data Centric AI, that we think you will find useful. https://mindkosh.com/data-centric-ai/research-papers.html https://mindkosh.com/data-centric-ai/open-source-tools.html submitted by /u/ifcarscouldspeak [link] [comments]  ( 1 min )
    [D] Handling variable number of outputs in an NN
    Hello, I was wondering whether it is possible to construct an architecture that handles a variable number of outputs (from a single input). The scenario is as follows: ​ In the dataset, all samples have 3 output labels. However, some samples might have additional labels besides the first 3 (e.g. 6, 9 or 12) and they are in groups of 3. That is, output labels 4, 5 and 6 are either all present or none are present for a sample. ​ I could not find any literature on these types of problems, so I started thinking of an architecture that could learn this end-to-end. This is my (naive) attempt at that. Naive implementation of variable output network ​ Essentially, the input batch is first passed through a feature extractor (think BERT). Then, the labels which are known to be present in all labels are predicted via individual classification layers. ​ For the groups of labels which are not present in all samples, we first pass the features through individual binary classifiers (the `group x present?` boxes). If the result is positive, i.e. the group of labels is present, we pass on the features to another set of classification layers. As a loss, I was thinking of using binary cross entropy for the `group x present?` boxes, followed by a binary threshold activation, and regular cross entropy for the group classification layers. ​ Getting to my question, do you think this kind of construction makes sense? Did I create a monster? Are you aware of any literature that tackles these kinds of problems? ​ Thank you in advance! submitted by /u/MurlocXYZ [link] [comments]  ( 2 min )
    [D] Is it better to train with fine-grained labels, or more broad, super-labels which encompass the fine-grained ones?
    Lets say you're interested in training a Fruit / Vegetable classifier You could train with general labels of Fruit and Vegetable - OR - You could train with fine-grained labels of Apple, Pear, Broccoli, and Eggplant, which you will convert into Fruit/Vegetable during inference. My intuition is telling me that fine-grained labels will provide more detailed losses, forcing the model to learn a wider range of features. My other intuition is telling me that learning to separate Apple and Pear doesn't assist in separating Fruit and Vegetable. Is there any research, or even your own experience on whether you get better results training a neural network with general classes vs fine-grained classes? submitted by /u/lukeiy [link] [comments]  ( 2 min )
    [D] Universal Approximation of Functions on Sets
    Abs: " Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model's latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a naïve constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general. " ​ Arxiv: https://arxiv.org/abs/2107.01959 A YT vid over paper: https://www.youtube.com/watch?v=V9A6S8uGuPs&ab_channel=ZachData submitted by /u/Consistent_Data_7233 [link] [comments]  ( 1 min )
    [P] anomaly detection in time series data
    Hi All! What could be some approaches while dealing with anomaly detection in time series data? Im looking at the data that doesn’t show seasonality as much, it’s the sudden changes compared to previous data points that need to be identified as anomalies. Please suggest methods that i could use. Simple stats, sophisticated models any help is greatly appreciated! submitted by /u/Tasty_Hat1965 [link] [comments]  ( 1 min )
    [N] Who Is Behind QAnon? Linguistic Detectives Find Fingerprints using statistics and machine learning
    According to the New-York Times, using machine learning, stylometry, and statistics on Q texts, two separate teams of NLP researchers from France and Swiss have identified the same two men as likely authors of messages that fueled the QAnon movement. First the initiator, Paul Furber, a South African software developer and then Ron Watkins took over, who operated 8chan website where the Q messages began appearing in 2018 and is now running election for Republican in Arizona. submitted by /u/ClaudeCoulombe [link] [comments]  ( 3 min )
  • Open

    Chatbot
    I don't know too much about neural networks, as I'm only a high school student, however I learned how they work and made a few simple programs. I'm curious, if it's possible to make a chatbot, that would be able to communicate in many topics, using the following logic: there would be a neural network, which would get the user's text as an input and try to categorize it (eg. Movies, Weather, Astronomy, News, etc.) and then the output would be the category. All the categories would have their own neural networks, and for example if the output is "Movies", it would send the user data as an input into the movies neural network, and that neural network would find an appropriate answer. For example the user says "What should I watch?", then the first NN would categorize it as "Movies" and then the movies NN would choose the accurate answer from answers like "I heard Interstellar is an awesome sci-fi" and "I don't like horror movies" etc. Is it possible, and if so, could anyone recommend me tutorials/blogs/sources of building something similar? submitted by /u/Quirky-Temperature18 [link] [comments]  ( 1 min )
    Machine Learning in Scraping with Rails
    submitted by /u/Kagermanov [link] [comments]
    Which AI can you recommend to me, which can generate normal texts that make as much sense as possible, where I can say at certain points that a word belongs in the text.
    submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Introducing Artificial Intelligence Survey Newsletter
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    EU Artificial Intelligence Act: Risk Levels
    submitted by /u/Philo167 [link] [comments]  ( 1 min )
    Question about true positive and false positive detections in object detection
    Hi all, I am computing true positives, false positives, and false negatives in order to calculate my model's precision and recall. I am using YOLOv5. According to this source and this one, an IoU overlap between the predicted and ground truth bounding box greater than 50% is a true positive detection (we are assuming we set the IoU threshold at 0.5), and an overlap less than 50% is considered to be a false positive. Let's say you have multiple objects in your image that you are detecting. You label one object as "class A", but the model predicts it as "class B." Am I correct that this is also a false positive? As I understand it, there are two ways of getting a false positive: Achieving an IoU overlap of less than 50% of the correctly-predicted class, and identifying any overlap of an incorrect class? submitted by /u/EnvironmentalLemon36 [link] [comments]  ( 1 min )
    Blog on 'Support Vector Machine'
    I published a new blog post on Support Vector Machine which covers intuitional understanding to handling non-linear datapoints (Kernel SVM). ​ Check it out. https://medium.com/@wasimmadha/decision-boundaries-using-support-vector-machines-svms-9554e0e1a5a9 submitted by /u/wasimakil [link] [comments]
    Hey Folks, Here is a really cool research by IBM Researchers where they built a Machine Learning model that can help harness the power of enzymes for Greener Chemistry
    The development of ecologically acceptable biochemical substitutes for industrial processes might be accelerated thanks to nature’s molecular machinery. Enzymes are the master accelerators of nearly every activity in the human body, helping with everything from digestion to the breakdown of hazardous chemicals and even DNA replication. Enzymes’ relevance extends beyond biology; they’re also utilized to make industrial chemical processes more environmentally friendly by reducing energy consumption and the number of harmful solvents needed in their production. The enzyme Xylanase treatment in paper manufacture, for example, has been demonstrated to reduce chlorine usage by 15% and toxic adsorbable organic halides (a chlorine byproduct) by 25% when producing white paper for printing or use in notebooks. You can continue reading this article and tell us your feedback. Github: https://github.com/rxn4chemistry/biocatalysis-model Project: https://rxn.res.ibm.com/ Paper: https://www.nature.com/articles/s41467-022-28536-w.pdf submitted by /u/techsucker [link] [comments]  ( 1 min )
    Lifetime Access to 170+ GPT3 Resources
    Hi Makers, Good day. Here I am with my next product. https://shotfox.gumroad.com/l/gpt-3resources For the past few months, I am working on collecting all the GPT-3 related resources, that inlcludes, tweets, github repos, articles, and much more for my next GPT-3 product idea. By now, the resource count have reached almost 170+ and thought of putting this valuable database to public and here I am. If you are also someone who is admirer of GPT-3 and wanted to know from its basics till where it is used in the current world, this resource database would help you a lot. Have categorized the resources into multiple as below: Articles Code Generator Content Creation Design Fun Ideas Github Repos GPT3 Community Ideas Notable Takes Products Reasoning Social Media Marketing Text processing Tutorial Utilities Website Builder submitted by /u/bhaskar2191 [link] [comments]  ( 1 min )
    A new study finds that in a remote environment, an artificial intelligence (AI) tutoring system can outperform expert human instructors.
    submitted by /u/qptbook [link] [comments]
    coursera stanford course
    Hello guys. I was looking into learning AI. I googled online courses and I came across this stanford course https://www.coursera.org/learn/machine-learning . Is this a good way to introduce myself into the field? Also at the end, i can pay money to get certified, is that certificate worth the money i will pay? thank you for helping submitted by /u/playboicairoo [link] [comments]  ( 1 min )
    Deepfakes Can Effectively Fool Many Major Facial ‘Liveness’ APIs
    submitted by /u/DaveBowman1975 [link] [comments]
    Explore MLIR Development Process
    submitted by /u/Just0by [link] [comments]
    DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks
    submitted by /u/nick7566 [link] [comments]  ( 1 min )
    Animate Your Pictures Realistically With AI!
    submitted by /u/OnlyProggingForFun [link] [comments]
    H2O.ai Democratizes State-of-the-Art Deep Learning for Data Scientists and Developers with H2O Hydrogen Torch
    Application development that combines sophisticated technologies such as AI and ML is progressing, this time by combining multiple deployment outcomes into a single general-purpose, no-code platform. Line-of-business users may easily transition from studying data records to natural language processing, image, and video outputs in this manner. Because so many software developers can only focus on one of those outputs at a time, this type of adaptability for non-IT personnel hasn’t been offered on the market. Market heavyweights like DataRobot, Amazon Web Services, Microsoft, DataBricks, and SAS don’t provide this particular feature. H2O.ai, on the other hand, has set out to overcome this problem. The company is situated in Mountain View, California. H2O.ai has announced H2O Hydrogen Torch, a significant new addition to its open-source platform. This feature is a deep-learning training engine that, according to the company, makes it easy for every size and area of the company to generate a cutting-edge image, video, and natural language processing (NLP) models without scripting. These models may be utilized in the field to uncover fresh business insights about consumers, rivals, the market, and other topics. CONTINUE READING https://h2o.ai/platform/h2o-hydrogen-torch/ submitted by /u/techsucker [link] [comments]  ( 1 min )
  • Open

    Is reinforcement learning a scam?
    It promises us to create agents, basically to replace people forever. But it only works for games LMAO, well maybe some game developer will sit down and code a plausible simulation of REALITY. Thoughts? submitted by /u/JewishRevenge1939 [link] [comments]  ( 1 min )
  • Open

    "Yann LeCun on a vision to make AI systems learn and reason like animals and humans" (sketching an AGI arch using self-supervised learning)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Do you know of any companies that are applying reinforcement learning in production in agriculture or healthcare?
    I have come across many academic papers that propose solutions in both industries but have not found many companies that use RL as part of their core product - any specific companies would be much appreciated! submitted by /u/notabot789 [link] [comments]  ( 1 min )
    Viable algorithms for multiple discrete actions
    I am currently considering applying reinforcement learning to dynamically adjust prices for a tourist attraction. There are eight different timeslots which all have different amounts of tickets available. What I would like to do is to utilize an agent that will assign daily every individual time slot one of 5 different price categories. The agent would also look at external factors such as amount of tickets left per slot, current prices, demand forecast, days until arrival, etc. Meanwhile the reward would be the accumulated revenue. Formulating the action space into a single action at a time would result in 510 amount of possible different actions. I have considered two alternative approaches for this Have the agent have the following actions: move up slot, move down slot, increase price, decrease price, done adjusting prices for the day. Have an agent output multiple actions where each action is allocating a given slot one of five different pricing categories. My guess is that option two is better suited for this but I am not too familiar which which algorithms might provide the functionality that I need. If you can point in the right directions or if there is something that I might not be considering, any help would be appreciated. submitted by /u/TheSurvivingHalf [link] [comments]  ( 2 min )
    What do bias and variance terms mean in RL Algorithms w.r.t training /testing errors . Does high variance means simply high testing error .
    What are the main disadvantage due to high variance/bias ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
  • Open

    4D-Net: Learning Multi-Modal Alignment for 3D and Image Inputs in Time
    Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research While not immediately obvious, all of us experience the world in four dimensions (4D). For example, when walking or driving down the street we observe a stream of visual inputs, snapshots of the 3D world, which, when taken together in time, creates a 4D visual input. Today’s autonomous vehicles and robots are able to capture much of this information through various onboard sensing mechanisms, such as LiDAR and cameras. LiDAR is a ubiquitous sensor that uses light pulses to reliably measure the 3D coordinates of objects in a scene, however, it is also sparse and has a limited range — the farther one is from a sensor, the fewer points will be returned. This means that far-away objects might only get a hand…  ( 8 min )
  • Open

    COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems
    Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or […] The post COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems appeared first on Microsoft Research.  ( 11 min )
  • Open

    Meet the Omnivore: 3D Creator Makes Fine Art for Digital Era Inspired by Silk Road Masterpieces
    Within the Mogao Caves, a cultural crossroads along what was the Silk Road in northwestern China, lies a natural reserve of tens of thousands of historical documents, paintings and statues of the Buddha.  The post Meet the Omnivore: 3D Creator Makes Fine Art for Digital Era Inspired by Silk Road Masterpieces appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Unsupervised Skill Discovery with Contrastive Intrinsic Control
    Unsupervised Reinforcement Learning (RL), where RL agents pre-train with self-supervised rewards, is an emerging paradigm for developing RL agents that are capable of generalization. Recently, we released the Unsupervised RL Benchmark (URLB) which we covered in a previous post. URLB benchmarked many unsupervised RL algorithms across three categories — competence-based, knowledge-based, and data-based algorithms. A surprising finding was that competence-based algorithms significantly underperformed other categories. In this post we will demystify what has been holding back competence-based methods and introduce Contrastive Intrinsic Control (CIC), a new competence-based algorithm that is the first to achieve leading results on URLB. Results from benchmarking unsupervised RL algorithms To re…  ( 3 min )
  • Open

    How Good are Low-Rank Approximations in Gaussian Process Regression?. (arXiv:2112.06410v3 [stat.ML] UPDATED)
    We provide guarantees for approximate Gaussian Process (GP) regression resulting from two common low-rank kernel approximations: based on random Fourier features, and based on truncating the kernel's Mercer expansion. In particular, we bound the Kullback-Leibler divergence between an exact GP and one resulting from one of the afore-described low-rank approximations to its kernel, as well as between their corresponding predictive densities, and we also bound the error between predictive mean vectors and between predictive covariance matrices computed using the exact versus using the approximate GP. We provide experiments on both simulated data and standard benchmarks to evaluate the effectiveness of our theoretical bounds.  ( 2 min )
    Embedding Decomposition for Artifacts Removal in EEG Signals. (arXiv:2112.00989v2 [cs.LG] UPDATED)
    Electroencephalogram (EEG) recordings are often contaminated with artifacts. Various methods have been developed to eliminate or weaken the influence of artifacts. However, most of them rely on prior experience for analysis. Here, we propose an deep learning framework to separate neural signal and artifacts in the embedding space and reconstruct the denoised signal, which is called DeepSeparator. DeepSeparator employs an encoder to extract and amplify the features in the raw EEG, a module called decomposer to extract the trend, detect and suppress artifact and a decoder to reconstruct the denoised signal. Besides, DeepSeparator can extract the artifact, which largely increases the model interpretability. The proposed method is tested with a semi-synthetic EEG dataset and a real task-related EEG dataset, suggesting that DeepSeparator outperforms the conventional models in both EOG and EMG artifact removal. DeepSeparator can be extended to multi-channel EEG and data of any length. It may motivate future developments and application of deep learning-based EEG denoising. The code for DeepSeparator is available at https://github.com/ncclabsustech/DeepSeparator.  ( 2 min )
    On out-of-distribution detection with Bayesian neural networks. (arXiv:2110.06020v2 [cs.LG] UPDATED)
    The question whether inputs are valid for the problem a neural network is trying to solve has sparked interest in out-of-distribution (OOD) detection. It is widely assumed that Bayesian neural networks (BNNs) are well suited for this task, as the endowed epistemic uncertainty should lead to disagreement in predictions on outliers. In this paper, we question this assumption and show that proper Bayesian inference with function space priors induced by neural networks does not necessarily lead to good OOD detection. To circumvent the use of approximate inference, we start by studying the infinite-width case, where Bayesian inference can be exact due to the correspondence with Gaussian processes. Strikingly, the kernels derived from common architectural choices lead to function space priors which induce predictive uncertainties that do not reflect the underlying input data distribution and are therefore unsuited for OOD detection. Importantly, we find the OOD behavior in this limiting case to be consistent with the corresponding finite-width case. To overcome this limitation, useful function space properties can also be encoded in the prior in weight space, however, this can currently only be applied to a specified subset of the domain and thus does not inherently extend to OOD data. Finally, we argue that a trade-off between generalization and OOD capabilities might render the application of BNNs for OOD detection undesirable in practice. Overall, our study discloses fundamental problems when naively using BNNs for OOD detection and opens interesting avenues for future research.  ( 2 min )
    EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks. (arXiv:2108.05233v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown superior performance in analyzing attributed networks in various web-based applications such as social recommendation and web search. Nevertheless, in high-stake decision-making scenarios such as online fraud detection, there is an increasing societal concern that GNNs could make discriminatory decisions towards certain demographic groups. Despite recent explorations on fair GNNs, these works are tailored for a specific GNN model. However, myriads of GNN variants have been proposed for different applications, and it is costly to fine-tune existing debiasing algorithms for each specific GNN architecture. Different from existing works that debias GNN models, we aim to debias the input attributed network to achieve fairer GNNs through feeding GNNs with less biased data. Specifically, we propose novel definitions and metrics to measure the bias in an attributed network, which leads to the optimization objective to mitigate bias. We then develop a framework EDITS to mitigate the bias in attributed networks while maintaining the performance of GNNs in downstream tasks. EDITS works in a model-agnostic manner, i.e., it is independent of any specific GNN. Experiments demonstrate the validity of the proposed bias metrics and the superiority of EDITS on both bias mitigation and utility maintenance. Open-source implementation: https://github.com/yushundong/EDITS.  ( 2 min )
    Learning Filterbanks for End-to-End Acoustic Beamforming. (arXiv:2111.04614v2 [eess.AS] UPDATED)
    Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows.  ( 2 min )
    DeepHAM: A Global Solution Method for Heterogeneous Agent Models with Aggregate Shocks. (arXiv:2112.14377v2 [econ.GN] UPDATED)
    An efficient, reliable, and interpretable global solution method, the Deep learning-based algorithm for Heterogeneous Agent Models (DeepHAM), is proposed for solving high dimensional heterogeneous agent models with aggregate shocks. The state distribution is approximately represented by a set of optimal generalized moments. Deep neural networks are used to approximate the value and policy functions, and the objective is optimized over directly simulated paths. In addition to being an accurate global solver, this method has three additional features. First, it is computationally efficient in solving complex heterogeneous agent models, and it does not suffer from the curse of dimensionality. Second, it provides a general and interpretable representation of the distribution over individual states, which is crucial in addressing the classical question of whether and how heterogeneity matters in macroeconomics. Third, it solves the constrained efficiency problem as easily as it solves the competitive equilibrium, which opens up new possibilities for studying optimal monetary and fiscal policies in heterogeneous agent models with aggregate shocks.  ( 2 min )
    Urban Fire Station Location Planning using Predicted Demand and Service Quality Index. (arXiv:2109.02160v2 [cs.LG] UPDATED)
    In this article, we propose a systematic approach for fire station location planning. We develop machine learning models, based on Random Forest and Extreme Gradient Boosting, for demand prediction and utilize the models further to define a generalized index to measure quality of fire service in urban settings. Our model is built upon spatial data collected from multiple different sources. Efficacy of proper facility planning depends on choice of candidates where fire stations can be located along with existing stations, if any. Also, the travel time from these candidates to demand locations need to be taken care of to maintain fire safety standard. Here, we propose a travel time based clustering technique to identify suitable candidates. Finally, we develop an optimization problem to select best locations to install new fire stations. Our optimization problem is built upon maximum coverage problem, based on integer programming. We further develop a two-stage stochastic optimization model to characterize the confidence in our decision outcome. We present a detailed experimental study of our proposed approach in collaboration with city of Victoria Fire Department, MN, USA. Our demand prediction model achieves true positive rate of 80% and false positive rate of 20% approximately. We aid Victoria Fire Department to select a location for a new fire station using our approach. We present detailed results on improvement statistics by locating a new facility, as suggested by our methodology, in the city of Victoria.  ( 2 min )
    Trap of Feature Diversity in the Learning of MLPs. (arXiv:2112.00980v2 [cs.LG] UPDATED)
    In this paper, we discover a two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). I.e., in the first phase, the training loss does not decrease significantly, but the similarity of features between different samples keeps increasing, which hurts the feature diversity. We explain such a two-phase phenomenon in terms of the learning dynamics of the MLP. Furthermore, we propose two normalization operations to eliminate the two-phase phenomenon, which avoids the decrease of the feature diversity and speeds up the training process.  ( 2 min )
    ImportantAug: a data augmentation agent for speech. (arXiv:2112.07156v2 [eess.AS] UPDATED)
    We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.  ( 2 min )
    Generalized Bayesian Upper Confidence Bound with Approximate Inference for Bandit Problems. (arXiv:2201.12955v2 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate inference have been widely used in practice with superior performance. Yet, few studies regarding the fundamental understanding of their performances are available. In this paper, we propose a Bayesian bandit algorithm, which we call Generalized Bayesian Upper Confidence Bound (GBUCB), for bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that in Bernoulli multi-armed bandit, GBUCB can achieve $O(\sqrt{T}(\log T)^c)$ frequentist regret if the inference error measured by symmetrized Kullback-Leibler divergence is controllable. This analysis relies on a novel sensitivity analysis for quantile shifts with respect to inference errors. To our best knowledge, our work provides the first theoretical regret bound that is better than $o(T)$ in the setting of approximate inference. Our experimental evaluations on multiple approximate inference settings corroborate our theory, showing that our GBUCB is consistently superior to BUCB and Thompson sampling.  ( 2 min )
    A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits. (arXiv:2110.11819v2 [cs.LG] UPDATED)
    Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.  ( 2 min )
    CAFE: Catastrophic Data Leakage in Vertical Federated Learning. (arXiv:2110.15122v4 [cs.LG] UPDATED)
    Recent studies show that private training data can be leaked through the gradients sharing mechanism deployed in distributed machine learning systems, such as federated learning (FL). Increasing batch size to complicate data recovery is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack with theoretical justification to efficiently recover batch data from the shared aggregated gradients. We name our proposed method as catastrophic data leakage in vertical federated learning (CAFE). Comparing to existing data leakage attacks, our extensive experimental results on vertical FL settings demonstrate the effectiveness of CAFE to perform large-batch data leakage attack with improved data recovery quality. We also propose a practical countermeasure to mitigate CAFE. Our results suggest that private data participated in standard FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings. The code of our work is available at https://github.com/DeRafael/CAFE.  ( 2 min )
    Bit-wise Training of Neural Network Weights. (arXiv:2202.09571v1 [cs.LG])
    We introduce an algorithm where the individual bits representing the weights of a neural network are learned. This method allows training weights with integer values on arbitrary bit-depths and naturally uncovers sparse networks, without additional constraints or regularization techniques. We show better results than the standard training technique with fully connected networks and similar performance as compared to standard training for convolutional and residual networks. By training bits in a selective manner we found that the biggest contribution to achieving high accuracy is given by the first three most significant bits, while the rest provide an intrinsic regularization. As a consequence more than 90\% of a network can be used to store arbitrary codes without affecting its accuracy. These codes may be random noise, binary files or even the weights of previously trained networks.  ( 2 min )
    Adaptive Gaussian Processes on Graphs via Spectral Graph Wavelets. (arXiv:2110.12752v2 [cs.LG] UPDATED)
    Graph-based models require aggregating information in the graph from neighbourhoods of different sizes. In particular, when the data exhibit varying levels of smoothness on the graph, a multi-scale approach is required to capture the relevant information. In this work, we propose a Gaussian process model using spectral graph wavelets, which can naturally aggregate neighbourhood information at different scales. Through maximum likelihood optimisation of the model hyperparameters, the wavelets automatically adapt to the different frequencies in the data, and as a result our model goes beyond capturing low frequency information. We achieve scalability to larger graphs by using a spectrum-adaptive polynomial approximation of the filter function, which is designed to yield a low approximation error in dense areas of the graph spectrum. Synthetic and real-world experiments demonstrate the ability of our model to infer scales accurately and produce competitive performances against state-of-the-art models in graph-based learning tasks.  ( 2 min )
    Towards Fine-Grained Reasoning for Fake News Detection. (arXiv:2110.15064v3 [cs.CL] UPDATED)
    The detection of fake news often requires sophisticated reasoning skills, such as logically combining information by considering word-level subtle clues. In this paper, we move towards fine-grained reasoning for fake news detection by better reflecting the logical processes of human thinking and enabling the modeling of subtle clues. In particular, we propose a fine-grained reasoning framework by following the human information-processing model, introduce a mutual-reinforcement-based method for incorporating human knowledge about which evidence is more important, and design a prior-aware bi-channel kernel graph network to model subtle differences between pieces of evidence. Extensive experiments show that our model outperforms the state-of-the-art methods and demonstrate the explainability of our approach.  ( 2 min )
    RELAX: Representation Learning Explainability. (arXiv:2112.10161v2 [stat.ML] UPDATED)
    Despite the significant improvements that representation learning via self-supervision has led to when learning from unlabeled data, no methods exist that explain what influences the learned representation. We address this need through our proposed approach, RELAX, which is the first approach for attribution-based explanations of representations. Our approach can also model the uncertainty in its explanations, which is essential to produce trustworthy explanations. RELAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself, providing intuitive explanations and significantly outperforming the gradient-based baseline. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning, providing insights into different learning strategies. Finally, we illustrate the usability of RELAX in multi-view clustering and highlight that incorporating uncertainty can be essential for providing low-complexity explanations, taking a crucial step towards explaining representations.  ( 2 min )
    Fair Representation: Guaranteeing Approximate Multiple Group Fairness for Unknown Tasks. (arXiv:2109.00545v2 [cs.LG] UPDATED)
    Motivated by scenarios where data is used for diverse prediction tasks, we study whether fair representation can be used to guarantee fairness for unknown tasks and for multiple fairness notions simultaneously. We consider seven group fairness notions that cover the concepts of independence, separation, and calibration. Against the backdrop of the fairness impossibility results, we explore approximate fairness. We prove that, although fair representation might not guarantee fairness for all prediction tasks, it does guarantee fairness for an important subset of tasks -- the tasks for which the representation is discriminative. Specifically, all seven group fairness notions are linearly controlled by fairness and discriminativeness of the representation. When an incompatibility exists between different fairness notions, fair and discriminative representation hits the sweet spot that approximately satisfies all notions. Motivated by our theoretical findings, we propose to learn both fair and discriminative representations using pretext loss which self-supervises learning, and Maximum Mean Discrepancy as a fair regularizer. Experiments on tabular, image, and face datasets show that using the learned representation, downstream predictions that we are unaware of when learning the representation indeed become fairer for seven group fairness notions, and the fairness guarantees computed from our theoretical results are all valid.  ( 2 min )
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v2 [cs.LG] UPDATED)
    This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.  ( 2 min )
    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. (arXiv:2106.06406v2 [stat.ML] UPDATED)
    Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.  ( 2 min )
    Generative Adversarial Networks for Labelled Vibration Data Generation. (arXiv:2112.08195v1 [cs.LG] CROSS LISTED)
    As Structural Health Monitoring (SHM) being implemented more over the years, the use of operational modal analysis of civil structures has become more significant for the assessment and evaluation of engineering structures. Machine Learning (ML) and Deep Learning (DL) algorithms have been in use for structural damage diagnostics of civil structures in the last couple of decades. While collecting vibration data from civil structures is a challenging and expensive task for both undamaged and damaged cases, in this paper, the authors are introducing Generative Adversarial Networks (GAN) that is built on the Deep Convolutional Neural Network (DCNN) and using Wasserstein Distance for generating artificial labelled data to be used for structural damage diagnostic purposes. The authors named the developed model 1D W-DCGAN and successfully generated vibration data which is very similar to the input. The methodology presented in this paper will pave the way for vibration data generation for numerous future applications in the SHM domain.  ( 2 min )
    A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer. (arXiv:2110.08598v2 [eess.AS] UPDATED)
    We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.  ( 2 min )
    Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition. (arXiv:2110.05354v3 [cs.CL] UPDATED)
    Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.  ( 2 min )
    Generative Adversarial Networks for Labeled Data Creation for Structural Monitoring and Damage Detection. (arXiv:2112.03478v2 [cs.LG] CROSS LISTED)
    There has been a drastic progression in the field of Data Science in the last few decades and other disciplines have been continuously benefitting from it. Structural Health Monitoring (SHM) is one of those fields that use Artificial Intelligence (AI) such as Machine Learning (ML) and Deep Learning (DL) algorithms for condition assessment of civil structures based on the collected data. The ML and DL methods require plenty of data for training procedures; however, in SHM, data collection from civil structures is very exhaustive; particularly getting useful data (damage associated data) can be very challenging. This paper uses 1-D Wasserstein Deep Convolutional Generative Adversarial Networks using Gradient Penalty (1-D WDCGAN-GP) for synthetic labeled vibration data generation. Then, implements structural damage detection on different levels of synthetically enhanced vibration datasets by using 1-D Deep Convolutional Neural Network (1-D DCNN). The damage detection results show that the 1-D WDCGAN-GP can be successfully utilized to tackle data scarcity in vibration-based damage diagnostics of civil structures. Keywords: Structural Health Monitoring (SHM), Structural Damage Diagnostics, Structural Damage Detection, 1-D Deep Convolutional Neural Networks (1-D DCNN), 1-D Generative Adversarial Networks (1-D GAN), Deep Convolutional Generative Adversarial Networks (DCGAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP)  ( 2 min )
    DKM: Differentiable K-Means Clustering Layer for Neural Network Compression. (arXiv:2108.12659v4 [cs.LG] UPDATED)
    Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering-based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regularizers and parameters, DKM-based compression keeps the original loss function and model architecture fixed. We evaluated DKM-based compression on various DNN models for computer vision and natural language processing (NLP) tasks. Our results demonstrate that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB model size (29.4x model compression factor). For MobileNet-v1, which is a challenging DNN to compress, DKM delivers 63.9% top-1 ImageNet1k accuracy with 0.72 MB model size (22.4x model compression factor). This result is 6.8% higher top-1accuracy and 33% relatively smaller model size than the current state-of-the-art DNN compression algorithms. Additionally, DKM enables compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on GLUE NLP benchmarks.  ( 2 min )
    Echofilter: A Deep Learning Segmentation Model Improves the Automation, Standardization, and Timeliness for Post-Processing Echosounder Data in Tidal Energy Streams. (arXiv:2202.09648v1 [cs.LG])
    Understanding the abundance and distribution of fish in tidal energy streams is important for assessing the risk presented by the introduction of tidal energy devices into the habitat. However, the impressive tidal currents that make sites favorable for tidal energy development are often highly turbulent and entrain air into the water, complicating the interpretation of echosounder data. The portion of the water column contaminated by returns from entrained air must be excluded from data used for biological analyses. Application of a single algorithm to identify the depth-of-penetration of entrained-air is insufficient for a boundary that is discontinuous, depth-dynamic, porous, and widely variable across the tidal flow speeds which can range from 0 to 5m/s. Using a case study at a tidal energy demonstration site in the Bay of Fundy, we describe the development and application of deep learning models that produce a pronounced, consistent, substantial, and measurable improvement of the automated detection of the extent to which entrained-air has penetrated the water column. Our model, Echofilter, was highly responsive to the dynamic range of turbulence conditions and sensitive to the fine-scale nuances in the boundary position, producing an entrained-air boundary line with an average error of 0.32m on mobile downfacing and 0.5-1.0m on stationary upfacing data. The model's annotations had a high level of agreement with the human segmentation (mobile downfacing Jaccard index: 98.8%; stationary upfacing: 93-95%). This resulted in a 50% reduction in the time required for manual edits compared to the time required to manually edit the line placed by currently available algorithms. Because of the improved initial automated placement, the implementation of the models generated a marked increase in the standardization and repeatability of line placement.  ( 2 min )
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. (arXiv:2110.13214v3 [cs.CV] UPDATED)
    Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.  ( 2 min )
    SHARP: Shielding-Aware Robust Planning for Safe and Efficient Human-Robot Interaction. (arXiv:2110.00843v2 [cs.RO] UPDATED)
    Jointly achieving safety and efficiency in human-robot interaction (HRI) settings is a challenging problem, as the robot's planning objectives may be at odds with the human's own intent and expectations. Recent approaches ensure safe robot operation in uncertain environments through a supervisory control scheme, sometimes called "shielding", which overrides the robot's nominal plan with a safety fallback strategy when a safety-critical event is imminent. These reactive "last-resort" strategies (typically in the form of aggressive emergency maneuvers) focus on preserving safety without efficiency considerations; when the nominal planner is unaware of possible safety overrides, shielding can be activated more frequently than necessary, leading to degraded performance. In this work, we propose a new shielding-based planning approach that allows the robot to plan efficiently by explicitly accounting for possible future shielding events. Leveraging recent work on Bayesian human motion prediction, the resulting robot policy proactively balances nominal performance with the risk of high-cost emergency maneuvers triggered by low-probability human behaviors. We formalize Shielding-Aware Robust Planning (SHARP) as a stochastic optimal control problem and propose a computationally efficient framework for finding tractable approximate solutions at runtime. Our method outperforms the shielding-agnostic motion planning baseline (equipped with the same human intent inference scheme) on simulated driving examples with human trajectories taken from the recently released Waymo Open Motion Dataset.  ( 2 min )
    A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities. (arXiv:2111.14247v2 [cs.LG] UPDATED)
    Deep Learning (DL) models have achieved superior performance in many application domains, including vision, language, medical, commercial ads, entertainment, etc. With the fast development, both DL applications and the underlying serving hardware have demonstrated strong scaling trends, i.e., Model Scaling and Compute Scaling, for example, the recent pre-trained model with hundreds of billions of parameters with ~TB level memory consumption, as well as the newest GPU accelerators providing hundreds of TFLOPS. With both scaling trends, new problems and challenges emerge in DL inference serving systems, which gradually trends towards Large-scale Deep learning Serving systems (LDS). This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. By providing a novel taxonomy, summarizing the computing paradigms, and elaborating the recent technique advances, we hope that this survey could shed light on new optimization perspectives and motivate novel works in large-scale deep learning system optimization.  ( 2 min )
    Internal Wasserstein Distance for Adversarial Attack and Defense. (arXiv:2103.07598v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks that would trigger misclassification of DNNs but may be imperceptible to human perception. Adversarial defense has been important ways to improve the robustness of DNNs. Existing attack methods often construct adversarial examples relying on some metrics like the $\ell_p$ distance to perturb samples. However, these metrics can be insufficient to conduct adversarial attacks due to their limited perturbations. In this paper, we propose a new internal Wasserstein distance (IWD) to capture the semantic similarity of two samples, and thus it helps to obtain larger perturbations than currently used metrics such as the $\ell_p$ distance We then apply the internal Wasserstein distance to perform adversarial attack and defense. In particular, we develop a novel attack method relying on IWD to calculate the similarities between an image and its adversarial examples. In this way, we can generate diverse and semantically similar adversarial examples that are more difficult to defend by existing defense methods. Moreover, we devise a new defense method relying on IWD to learn robust models against unseen adversarial examples. We provide both thorough theoretical and empirical evidence to support our methods.  ( 2 min )
    Fair Conformal Predictors for Applications in Medical Imaging. (arXiv:2109.04392v2 [eess.IV] UPDATED)
    Deep learning has the potential to augment several clinically useful aspects of the radiologist's workflow such as medical imaging interpretation. However, the translation of deep learning algorithms into clinical practice has been hindered by relative lack of transparency in these algorithms compared to more traditional statistical methods. Specifically, common deep learning models lack intuitive and rigorous methods of conveying prediction confidence in a calibrated manner, which ultimately restricts widespread use of these "black box" systems for critical decision-making. Furthermore, numerous demonstrations of algorithmic bias in clinical machine learning have caused considerable hesitancy towards the deployment of these models for clinical application. To this end, we explore how conformal predictions can complement existing deep learning approaches by providing an intuitive way of expressing model uncertainty to facilitate greater transparency to clinical users. In this paper, we conduct field interviews with radiologists to assess potential use-cases of conformal predictors. Using insights collected from these interviews, we devise two use-cases and empirically evaluate several conformal methods on a dermatology photography dataset for skin lesion classification. Additionally, we show how group conformal predictors are more adaptive to differences between patient skin tones for malignant skin lesions. We find our conformal predictors to be a promising and generally applicable approach to increasing clinical usability and trustworthiness -- hopefully facilitating better modes of collaboration between medical AI tools and their clinical users.  ( 2 min )
    Information Flow in Deep Neural Networks. (arXiv:2202.06749v2 [cs.LG] UPDATED)
    Although deep neural networks have been immensely successful, there is no comprehensive theoretical understanding of how they work or are structured. As a result, deep networks are often seen as black boxes with unclear interpretations and reliability. Understanding the performance of deep neural networks is one of the greatest scientific challenges. This work aims to apply principles and techniques from information theory to deep learning models to increase our theoretical understanding and design better algorithms. We first describe our information-theoretic approach to deep learning. Then, we propose using the Information Bottleneck (IB) theory to explain deep learning systems. The novel paradigm for analyzing networks sheds light on their layered structure, generalization abilities, and learning dynamics. We later discuss one of the most challenging problems of applying the IB to deep neural networks - estimating mutual information. Recent theoretical developments, such as the neural tangent kernel (NTK) framework, are used to investigate generalization signals. In our study, we obtained tractable computations of many information-theoretic quantities and their bounds for infinite ensembles of infinitely wide neural networks. With these derivations, we can determine how compression, generalization, and sample size pertain to the network and how they are related. At the end, we present the dual Information Bottleneck (dualIB). This new information-theoretic framework resolves some of the IB's shortcomings by merely switching terms in the distortion function. The dualIB can account for known data features and use them to make better predictions over unseen examples. An analytical framework reveals the underlying structure and optimal representations, and a variational framework using deep neural network optimization validates the results.  ( 2 min )
    Provably efficient machine learning for quantum many-body problems. (arXiv:2106.12627v3 [quant-ph] UPDATED)
    Classical machine learning (ML) provides a potentially powerful approach to solving challenging quantum many-body problems in physics and chemistry. However, the advantages of ML over more traditional methods have not been firmly established. In this work, we prove that classical ML algorithms can efficiently predict ground state properties of gapped Hamiltonians in finite spatial dimensions, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter. In contrast, under widely accepted complexity theory assumptions, classical algorithms that do not learn from data cannot achieve the same guarantee. We also prove that classical ML algorithms can efficiently classify a wide range of quantum phases of matter. Our arguments are based on the concept of a classical shadow, a succinct classical description of a many-body quantum state that can be constructed in feasible quantum experiments and be used to predict many properties of the state. Extensive numerical experiments corroborate our theoretical results in a variety of scenarios, including Rydberg atom systems, 2D random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases.
    Structured second-order methods via natural gradient descent. (arXiv:2107.10884v3 [stat.ML] UPDATED)
    In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v4 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
    Deep Learning for Technical Document Classification. (arXiv:2106.14269v5 [cs.LG] UPDATED)
    In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have only focused on processing text for classification, whereas technical documents often contain multimodal information. To leverage multimodal information for document classification to improve the model performance, this paper presents a novel multimodal deep learning architecture, TechDoc, which utilizes three types of information, including natural language texts and descriptive images within documents and the associations among the documents. The architecture synthesizes the convolutional neural network, recurrent neural network, and graph neural network through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that TechDoc presents a greater classification accuracy than the unimodal methods and other state-of-the-art benchmarks. The trained model can potentially be scaled to millions of real-world multimodal technical documents, which is useful for data and knowledge management in large technology companies and organizations.
    Subspace Regularizers for Few-Shot Class Incremental Learning. (arXiv:2110.07059v2 [cs.CV] UPDATED)
    Few-shot class incremental learning -- the problem of updating a trained classifier to discriminate among an expanded set of classes with limited labeled data -- is a key challenge for machine learning systems deployed in non-stationary environments. Existing approaches to the problem rely on complex model architectures and training procedures that are difficult to tune and re-use. In this paper, we present an extremely simple approach that enables the use of ordinary logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. When combined with pretrained convolutional feature extractors, logistic regression models trained with subspace regularization outperform specialized, state-of-the-art approaches to few-shot incremental image classification by up to 22% on the miniImageNet dataset. Because of its simplicity, subspace regularization can be straightforwardly extended to incorporate additional background information about the new classes (including class names and descriptions specified in natural language); these further improve accuracy by up to 2%. Our results show that simple geometric regularization of class representations offers an effective tool for continual learning.
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v3 [cs.LG] UPDATED)
    The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.
    PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation. (arXiv:2103.00053v2 [cs.LG] UPDATED)
    We propose a novel knowledge distillation methodology for compressing deep neural networks. One of the most efficient methods for knowledge distillation is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance with respect to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
    Why KDAC? A general activation function for knowledge discovery. (arXiv:2111.13858v2 [cs.LG] UPDATED)
    Named entity recognition based on deep learning (DNER) can effectively mine expected knowledge from large-scale unstructured and semi-structured text, and has gradually become the paradigm of knowledge discovery. Currently, Tanh, ReLU and Sigmoid dominate DNER, however, these activation functions failed to treat gradient vanishing, no negative output or non-differentiable existence, which may impede DNER's exploration of knowledge caused by the omission and the incomplete representation of latent semantics. To surmount the non-negligible obstacle, we present a novel and general activation function termed KDAC. Detailly, KDAC is a thought that can aggregate and inherit the merits of Tanh and ReLU since they are widely leveraged in various knowledge domains. The positive region corresponds to an adaptive linear design encouraged by ReLU. The negative region considers the interaction between exponent and linearity to surmount the obstacle of gradient vanishing and no negative value. Crucially, the non-differentiable points are alerted and eliminated by a smooth approximation. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The experimental results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising alternative activation function in DNER to devote itself to the construction of knowledge.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples: An Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v4 [cs.LG] UPDATED)
    One-step abductive multi-target learning (OSAMTL) is an approach proposed to handle complex noisy labels. However, OSAMTL is not suitable for the situation where diverse noisy samples (DNS) are provided for a learning task. In this paper, giving definition of DNS, we propose one-step abductive multi-target learning with DNS (OSAMTL-DNS) to expand the original OSAMTL to a wider range of tasks that handle complex noisy labels. Applying OSAMTL-DNS to tumour segmentation for breast cancer in medical histopathology whole slide image analysis, we show that OSAMTL-DNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve significantly more rational predictions.
    DiffCo: Auto-Differentiable Proxy Collision Detection with Multi-class Labels for Safety-Aware Trajectory Optimization. (arXiv:2102.07413v2 [cs.RO] UPDATED)
    The objective of trajectory optimization algorithms is to achieve an optimal collision-free path between a start and goal state. In real-world scenarios where environments can be complex and non-homogeneous, a robot needs to be able to gauge whether a state will be in collision with various objects in order to meet some safety metrics. The collision detector should be computationally efficient and, ideally, analytically differentiable to facilitate stable and rapid gradient descent during optimization. However, methods today lack an elegant approach to detect collision differentiably, relying rather on numerical gradients that can be unstable. We present DiffCo, the first, fully auto-differentiable, non-parametric model for collision detection. Its non-parametric behavior allows one to compute collision boundaries on-the-fly and update them, requiring no pre-training and allowing it to update continuously in dynamic environments. It provides robust gradients for trajectory optimization via backpropagation and is often 10-100x faster to compute than its geometric counterparts. DiffCo also extends trivially to modeling different object collision classes for semantically informed trajectory optimization.
    DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model. (arXiv:2112.14798v2 [physics.comp-ph] UPDATED)
    A long-standing problem in the modeling of non-Newtonian hydrodynamics of polymeric flows is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we develop DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention.
    Z2P: Instant Visualization of Point Clouds. (arXiv:2105.14548v2 [cs.GR] UPDATED)
    We present a technique for visualizing point clouds using a neural network. Our technique allows for an instant preview of any point cloud, and bypasses the notoriously difficult surface reconstruction problem or the need to estimate oriented normals for splat-based rendering. We cast the preview problem as a conditional image-to-image translation task, and design a neural network that translates point depth-map directly into an image, where the point cloud is visualized as though a surface was reconstructed from it. Furthermore, the resulting appearance of the visualized point cloud can be, optionally, conditioned on simple control variables (e.g., color and light). We demonstrate that our technique instantly produces plausible images, and can, on-the-fly effectively handle noise, non-uniform sampling, and thin surfaces sheets.
    Quantization Backdoors to Deep Learning Commercial Frameworks. (arXiv:2108.09187v2 [cs.CR] UPDATED)
    Currently, there is a burgeoning demand for deploying deep learning (DL) models on ubiquitous edge Internet of Things (IoT) devices attributed to their low latency and high privacy preservation. However, DL models are often large in size and require large-scale computation, which prevents them from being placed directly onto IoT devices, where resources are constrained and 32-bit floating-point (float-32) operations are unavailable. Commercial framework (i.e., a set of toolkits) empowered model quantization is a pragmatic solution that enables DL deployment on mobile devices and embedded systems by effortlessly post-quantizing a large high-precision model (e.g., float-32) into a small low-precision model (e.g., int-8) while retaining the model inference accuracy. However, their usability might be threatened by security vulnerabilities. This work reveals that the standard quantization toolkits can be abused to activate a backdoor. We demonstrate that a full-precision backdoored model which does not have any backdoor effect in the presence of a trigger -- as the backdoor is dormant -- can be activated by the default i) TensorFlow-Lite (TFLite) quantization, the only product-ready quantization framework to date, and ii) the beta released PyTorch Mobile framework. When each of the float-32 models is converted into an int-8 format model through the standard TFLite or Pytorch Mobile framework's post-training quantization, the backdoor is activated in the quantized model, which shows a stable attack success rate close to 100% upon inputs with the trigger, while it behaves normally upon non-trigger inputs. This work highlights that a stealthy security threat occurs when an end user utilizes the on-device post-training model quantization frameworks, informing security researchers of cross-platform overhaul of DL models post quantization even if these models pass front-end backdoor inspections.
    Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks. (arXiv:2108.13097v3 [stat.ML] UPDATED)
    Deep neural networks (DNNs) with the flexibility to learn good top-layer representations have eclipsed shallow kernel methods without that flexibility. Here, we take inspiration from DNNs to develop the deep kernel machine. Optimizing the deep kernel machine objective is equivalent to exact Bayesian inference (or noisy gradient descent) in an infinitely wide Bayesian neural network or deep Gaussian process, which has been scaled carefully to retain representation learning. Our work thus has important implications for theoretical understanding of neural networks. In addition, we show that the deep kernel machine objective has more desirable properties and better performance than other choices of objective. Finally, we conjecture that the deep kernel machine objective is unimodal. We give a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations we tried converged to the same solution.
    Security Analysis of Camera-LiDAR Fusion Against Black-Box Attacks on Autonomous Vehicles. (arXiv:2106.07098v4 [cs.CR] UPDATED)
    To enable safe and reliable decision-making, autonomous vehicles (AVs) feed sensor data to perception algorithms to understand the environment. Sensor fusion with multi-frame tracking is becoming increasingly popular for detecting 3D objects. Thus, in this work, we perform an analysis of camera-LiDAR fusion, in the AV context, under LiDAR spoofing attacks. Recently, LiDAR-only perception was shown vulnerable to LiDAR spoofing attacks; however, we demonstrate these attacks are not capable of disrupting camera-LiDAR fusion. We then define a novel, context-aware attack: frustum attack, and show that out of 8 widely used perception algorithms - across 3 architectures of LiDAR-only and 3 architectures of camera-LiDAR fusion - all are significantly vulnerable to the frustum attack. In addition, we demonstrate that the frustum attack is stealthy to existing defenses against LiDAR spoofing as it preserves consistencies between camera and LiDAR semantics. Finally, we show that the frustum attack can be exercised consistently over time to form stealthy longitudinal attack sequences, compromising the tracking module and creating adverse outcomes on end-to-end AV control.
    Strategic Instrumental Variable Regression: Recovering Causal Relationships From Strategic Responses. (arXiv:2107.05762v2 [cs.LG] UPDATED)
    In settings where Machine Learning (ML) algorithms automate or inform consequential decisions about people, individual decision subjects are often incentivized to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the assessment rule is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, can hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses: We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict, even under the presence of unobserved confounding variables. Specifically, our work establishes a novel connection between strategic responses to ML models and instrumental variable (IV) regression by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. We show that our causal recovery method can be utilized to improve decision-making across several important criteria: individual fairness, agent outcomes, and predictive risk. In particular, we show that if decision subjects differ in their ability to modify non-causal attributes, any decision rule deviating from the causal coefficients can lead to(potentially unbounded) individual-level unfairness.
    On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning. (arXiv:2110.11891v2 [cs.LG] UPDATED)
    Machine unlearning, i.e. having a model forget about some of its training data, has become increasingly more important as privacy legislation promotes variants of the right-to-be-forgotten. In the context of deep learning, approaches for machine unlearning are broadly categorized into two classes: exact unlearning methods, where an entity has formally removed the data point's impact on the model by retraining the model from scratch, and approximate unlearning, where an entity approximates the model parameters one would obtain by exact unlearning to save on compute costs. In this paper, we first show that the definition that underlies approximate unlearning, which seeks to prove the approximately unlearned model is close to an exactly retrained model, is incorrect because one can obtain the same model using different datasets. Thus one could unlearn without modifying the model at all. We then turn to exact unlearning approaches and ask how to verify their claims of unlearning. Our results show that even for a given training trajectory one cannot formally prove the absence of certain data points used during training. We thus conclude that unlearning is only well-defined at the algorithmic level, where an entity's only possible auditable claim to unlearning is that they used a particular algorithm designed to allow for external scrutiny during an audit.
    Transfer Learning in Quantum Parametric Classifiers: An Information-Theoretic Generalization Analysis. (arXiv:2201.06297v2 [quant-ph] UPDATED)
    A key step in quantum machine learning with classical inputs is the design of an embedding circuit mapping inputs to a quantum state. This paper studies a transfer learning setting in which classical-to-quantum embedding is carried out by an arbitrary parametric quantum circuit that is pre-trained based on data from a source task. At run time, the binary classifier is then optimized based on data from the target task of interest. Using an information-theoretic approach, we demonstrate that the average excess risk, or optimality gap, can be bounded in terms of two R\'enyi mutual information terms between classical input and quantum embedding under source and target tasks, as well as in terms of a measure of similarity between the source and target tasks related to the trace distance. The main theoretical results are validated on a simple binary classification example.
    Open-world Machine Learning: Applications, Challenges, and Opportunities. (arXiv:2105.13448v2 [cs.LG] UPDATED)
    Traditional machine learning mainly supervised learning, follows the assumptions of closed-world learning, i.e., for each testing class, a training class is available. However, such machine learning models fail to identify the classes which were not available during training time. These classes can be referred to as unseen classes. Whereas open-world machine learning (OWML) deals with unseen classes. In this paper, first, we present an overview of OWML with importance to the real-world context. Next, different dimensions of open-world machine learning are explored and discussed. The area of OWML gained the attention of the research community in the last decade only. We have searched through different online digital libraries and scrutinized the work done in the last decade. This paper presents a systematic review of various techniques for OWML. It also presents the research gaps, challenges, and future directions in open-world machine learning. This paper will help researchers understand the comprehensive developments of OWML and the likelihood of extending the research in suitable areas. It will also help to select applicable methodologies and datasets to explore this further.
    Bayesian Persuasion for Algorithmic Recourse. (arXiv:2112.06283v2 [cs.GT] UPDATED)
    When subjected to automated decision-making, decision subjects may strategically modify their observable features in ways they believe will maximize their chances of receiving a favorable decision. In many practical situations, the underlying assessment rule is deliberately kept secret to avoid gaming and maintain competitive advantage. The resulting opacity forces the decision subjects to rely on incomplete information when making strategic feature modifications. We capture such settings as a game of Bayesian persuasion, in which the decision maker offers a form of recourse to the decision subject by providing them with an action recommendation (or signal) to incentivize them to modify their features in desirable ways. We show that when using persuasion, both the decision maker and decision subject are never worse off in expectation, while the decision maker can be significantly better off. While the decision maker's problem of finding the optimal Bayesian incentive-compatible (BIC) signaling policy takes the form of optimization over infinitely-many variables, we show that this optimization can be cast as a linear program over finitely-many regions of the space of possible assessment rules. While this reformulation simplifies the problem dramatically, solving the linear program requires reasoning about exponentially-many variables, even under relatively simple settings. Motivated by this observation, we provide a polynomial-time approximation scheme that recovers a near-optimal signaling policy. Finally, our numerical simulations on semi-synthetic data empirically illustrate the benefits of using persuasion in the algorithmic recourse setting.
    Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. (arXiv:2107.12580v2 [cs.LG] UPDATED)
    Central to the success of artificial neural networks is their ability to generalize. But does neural network generalization primarily rely on seeing highly similar training examples (memorization)? Or are neural networks capable of human-intelligence styled reasoning, and if so, to what extent? These remain fundamental open questions on artificial neural networks. In this paper, as steps towards answering these questions, we introduce a new benchmark, Pointer Value Retrieval (PVR) to study the limits of neural network reasoning. The PVR suite of tasks is based on reasoning about indirection, a hallmark of human intelligence, where a first stage (task) contains instructions for solving a second stage (task). In PVR, this is done by having one part of the task input act as a pointer, giving instructions on a different input location, which forms the output. We show this simple rule can be applied to create a diverse set of tasks across different input modalities and configurations. Importantly, this use of indirection enables systematically varying task difficulty through distribution shifts and increasing functional complexity. We conduct a detailed empirical study of different PVR tasks, discovering large variations in performance across dataset sizes, neural network architectures and task complexity. Further, by incorporating distribution shift and increased functional complexity, we develop nuanced tests for reasoning, revealing subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.
    Equivariant Graph Attention Networks for Molecular Property Prediction. (arXiv:2202.09891v1 [cs.LG])
    Learning and reasoning about 3D molecular structures with varying size is an emerging and important challenge in machine learning and especially in drug discovery. Equivariant Graph Neural Networks (GNNs) can simultaneously leverage the geometric and relational detail of the problem domain and are known to learn expressive representations through the propagation of information between nodes leveraging higher-order representations to faithfully express the geometry of the data, such as directionality in their intermediate layers. In this work, we propose an equivariant GNN that operates with Cartesian coordinates to incorporate directionality and we implement a novel attention mechanism, acting as a content and spatial dependent filter when propagating information between nodes. We demonstrate the efficacy of our architecture on predicting quantum mechanical properties of small molecules and its benefit on problems that concern macromolecular structures such as protein complexes.  ( 2 min )
    Meta-learning an Intermediate Representation for Few-shot Block-wise Prediction of Landslide Susceptibility. (arXiv:2110.04922v2 [cs.LG] UPDATED)
    Predicting a landslide susceptibility map (LSM) is essential for risk recognition and disaster prevention. Despite the successful application of data-driven approaches for LSM prediction, most methods generally apply a single global model to predict the LSM for an entire target region. However, in large-scale areas with significant environmental change, various parts of the region hold different landslide-inducing environments, and therefore, should be predicted with respective models. This study first segmented target scenarios into blocks for individual analysis. Then, the critical problem is that in each block with limited samples, conducting training and testing a model is impossible for a satisfactory LSM prediction, especially in dangerous mountainous areas where landslide surveying is expensive. To solve the problem, we trained an intermediate representation by the meta-learning paradigm, which is superior for capturing information valuable for few-shot adaption from LSM tasks. We hypothesized that there are more general and vital concepts concerning landslide causes and are sensitive to variations in input features. Thus, we can quickly few-shot adapt the models from the intermediate representation for different blocks or even unseen tasks using very few exemplar samples. Experimental results on the two study areas demonstrated the validity of our block-wise analysis in large scenarios and revealed the top few-shot adaption performances of the proposed methods.
    Nonseparable Symplectic Neural Networks. (arXiv:2010.12636v3 [cs.LG] UPDATED)
    Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism.  ( 2 min )
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v2 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.  ( 2 min )
    Towards Secure and Practical Machine Learning via Secret Sharing and Random Permutation. (arXiv:2108.07463v3 [cs.LG] UPDATED)
    With the increasing demands for privacy protection, privacy-preserving machine learning has been drawing much attention in both academia and industry. However, most existing methods have their limitations in practical applications. On the one hand, although most cryptographic methods are provable secure, they bring heavy computation and communication. On the other hand, the security of many relatively efficient private methods (e.g., federated learning and split learning) is being questioned, since they are non-provable secure. Inspired by previous work on privacy-preserving machine learning, we build a privacy-preserving machine learning framework by combining random permutation and arithmetic secret sharing via our compute-after-permutation technique. Since our method reduces the cost for element-wise function computation, it is more efficient than existing cryptographic methods. Moreover, by adopting distance correlation as a metric for privacy leakage, we demonstrate that our method is more secure than previous non-provable secure methods. Overall, our proposal achieves a good balance between security and efficiency. Experimental results show that our method not only is up to 6x faster and reduces up to 85% network traffic compared with state-of-the-art cryptographic methods, but also leaks less privacy during the training process compared with non-provable secure methods.  ( 2 min )
    Learning Representations Robust to Group Shifts and Adversarial Examples. (arXiv:2202.09446v1 [cs.LG])
    Despite the high performance achieved by deep neural networks on various tasks, extensive studies have demonstrated that small tweaks in the input could fail the model predictions. This issue of deep neural networks has led to a number of methods to improve model robustness, including adversarial training and distributionally robust optimization. Though both of these two methods are geared towards learning robust models, they have essentially different motivations: adversarial training attempts to train deep neural networks against perturbations, while distributional robust optimization aims at improving model performance on the most difficult "uncertain distributions". In this work, we propose an algorithm that combines adversarial training and group distribution robust optimization to improve robust representation learning. Experiments on three image benchmark datasets illustrate that the proposed method achieves superior results on robust metrics without sacrificing much of the standard measures.  ( 2 min )
    Signal Processing on Cell Complexes. (arXiv:2110.05614v2 [cs.LG] UPDATED)
    The processing of signals supported on non-Euclidean domains has attracted large interest recently. Thus far, such non-Euclidean domains have been abstracted primarily as graphs with signals supported on the nodes, though the processing of signals on more general structures such as simplicial complexes has also been considered. In this paper, we give an introduction to signal processing on (abstract) regular cell complexes, which provide a unifying framework encompassing graphs, simplicial complexes, cubical complexes and various meshes as special cases. We discuss how appropriate Hodge Laplacians for these cell complexes can be derived. These Hodge Laplacians enable the construction of convolutional filters, which can be employed in linear filtering and non-linear filtering via neural networks defined on cell complexes.
    Limitations of machine learning for building energy prediction: ASHRAE Great Energy Predictor III Kaggle competition error analysis. (arXiv:2106.13475v3 [cs.LG] UPDATED)
    Research is needed to explore the limitations and potential for improvement of machine learning for building energy prediction. With this aim, the ASHRAE Great Energy Predictor III (GEPIII) Kaggle competition was launched in 2019. This effort was the largest building energy meter machine learning competition of its kind, with 4,370 participants who submitted 39,403 predictions. The test data set included two years of hourly whole building readings from 2,380 meters in 1,448 buildings at 16 locations. This paper analyzes the various sources and types of residual model error from an aggregation of the competition's top 50 solutions. This analysis reveals the limitations for machine learning using the standard model inputs of historical meter, weather, and basic building metadata. The errors are classified according to timeframe, behavior, magnitude, and incidence in single buildings or across a campus. The results show machine learning models have errors within a range of acceptability (RMSLE_scaled = 0.3) occur in 4.8% of the test data and are unlikely to be accurately predicted.  ( 2 min )
    Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training. (arXiv:2108.13880v2 [cs.LG] UPDATED)
    A fundamental challenge in Deep Learning is to find optimal step sizes for stochastic gradient descent automatically. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the full-batch loss is unfeasibly expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest, inter alia, that the full-batch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size changes slowly. By exploiting these and more findings, this work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. Learning rates are derived from such parabolas during training. In the experiments conducted, our approach is on par with SGD with Momentum tuned with a piece-wise constant learning rate schedule and often outperforms other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy. In addition, our approach is the first line search approach for Deep Learning that samples a larger batch size over multiple inferences to still work in low-batch scenarios.
    AdaGNN: Graph Neural Networks with Adaptive Frequency Response Filter. (arXiv:2104.12840v3 [cs.LG] UPDATED)
    Graph Neural Networks have recently become a prevailing paradigm for various high-impact graph analytical problems. Existing efforts can be mainly categorized as spectral-based and spatial-based methods. The major challenge for the former is to find an appropriate graph filter to distill discriminative information from input signals for learning. Recently, myriads of explorations are made to achieve better graph filters, e.g., Graph Convolutional Network (GCN), which leverages Chebyshev polynomial truncation to seek an approximation of graph filters and bridge these two families of methods. Nevertheless, it has been shown in recent studies that GCN and its variants are essentially employing fixed low-pass filters to perform information denoising. Thus their learning capability is rather limited and may over-smooth node representations at deeper layers. To tackle these problems, we develop a novel graph neural network framework AdaGNN with a well-designed adaptive frequency response filter. At its core, AdaGNN leverages a simple but elegant trainable filter that spans across multiple layers to capture the varying importance of different frequency components for node representation learning. The inherent differences among different feature channels are also well captured by the filter. As such, it empowers AdaGNN with stronger expressiveness and naturally alleviates the over-smoothing problem. We empirically validate the effectiveness of the proposed framework on various benchmark datasets. Theoretical analysis is also provided to show the superiority of the proposed AdaGNN. The open-source implementation of AdaGNN can be found here: https://github.com/yushundong/AdaGNN.
    Bad priors, their threats to Bayesian optimization and some remedies via prior learning. (arXiv:2109.08215v2 [cs.LG] UPDATED)
    The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 2 min )
    On the Efficiency of Subclass Knowledge Distillation in Classification Tasks. (arXiv:2109.05587v2 [cs.LG] UPDATED)
    This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary detection (two classes) the amount of information transferred from the teacher to the student network is restricted, thus limiting the utility of knowledge distillation. Performance can be improved by leveraging information about possible subclasses within the available classes in the classification task. To that end, we propose the so-called Subclass Knowledge Distillation (SKD) framework, which is the process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one. Through SKD, additional meaningful information which is not in the teacher's class logits but exists in subclasses (e.g., similarities inside classes) will be conveyed to the student and boost its performance. Mathematically, we measure how many extra information bits the teacher can provide for the student via SKD framework. The framework developed is evaluated in clinical application, namely colorectal polyp binary classification. In this application, clinician-provided annotations are used to define subclasses based on the annotation label's variability in a curriculum style of learning. A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without and with conventional knowledge distillation, respectively. These results show that the extra subclasses' knowledge (i.e., 0.4656 label bits per training sample in our experiment) can provide more information about the teacher generalization, and therefore SKD can benefit from using more information to increase the student performance.  ( 2 min )
    Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions. (arXiv:2111.01235v2 [cs.LG] UPDATED)
    People affected by machine learning model decisions may benefit greatly from access to recourses, i.e. suggestions about what features they could change to receive a more favorable decision from the model. Current approaches try to optimize for the cost incurred by users when adopting a recourse, but they assume that all users share the same cost function. This is an unrealistic assumption because users might have diverse preferences about their willingness to change certain features. In this work, we introduce a new method for identifying recourse sets for users which does not assume that users' preferences are known in advance. We propose an objective function, Expected Minimum Cost (EMC), based on two key ideas: (1) when presenting a set of options to a user, there only needs to be one low-cost solution that the user could adopt; (2) when we do not know the user's true cost function, we can approximately optimize for user satisfaction by first sampling plausible cost functions from a distribution, then finding a recourse set that achieves a good cost for these samples. We optimize EMC with a novel discrete optimization algorithm, Cost Optimized Local Search (COLS), which is guaranteed to improve the recourse set quality over iterations. Experimental evaluation on popular real-world datasets with simulated users demonstrates that our method satisfies up to 25.89 percentage points more users compared to strong baseline methods, while, the human evaluation shows that our recourses are preferred more than twice as often as the strongest baseline recourses. Finally, using standard fairness metrics we show that our method can provide more fair solutions across demographic groups than baselines. We provide our source code at: https://github.com/prateeky2806/EMC-COLS-recourse  ( 2 min )
    Pyramid Convolutional RNN for MRI Image Reconstruction. (arXiv:1912.00543v6 [eess.IV] UPDATED)
    Fast and accurate MRI image reconstruction from undersampled data is crucial in clinical practice. Deep learning based reconstruction methods have shown promising advances in recent years. However, recovering fine details from undersampled data is still challenging. In this paper, we introduce a novel deep learning based method, Pyramid Convolutional RNN (PC-RNN), to reconstruct images from multiple scales. Based on the formulation of MRI reconstruction as an inverse problem, we design the PC-RNN model with three convolutional RNN (ConvRNN) modules to iteratively learn the features in multiple scales. Each ConvRNN module reconstructs images at different scales and the reconstructed images are combined by a final CNN module in a pyramid fashion. The multi-scale ConvRNN modules learn a coarse-to-fine image reconstruction. Unlike other common reconstruction methods for parallel imaging, PC-RNN does not employ coil sensitive maps for multi-coil data and directly model the multiple coils as multi-channel inputs. The coil compression technique is applied to standardize data with various coil numbers, leading to more efficient training. We evaluate our model on the fastMRI knee and brain datasets and the results show that the proposed model outperforms other methods and can recover more details. The proposed method is one of the winner solutions in the 2019 fastMRI competition.  ( 2 min )
    Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. (arXiv:2105.02337v2 [stat.ML] UPDATED)
    Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low-efficiency drawback. Trimmed mean and median of means, %as robust alternatives to the sample mean, and achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than $25\%$ contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to $50\%$ contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.
    A Deep Generative Model for Reordering Adjacency Matrices. (arXiv:2110.04971v2 [cs.HC] UPDATED)
    Depending on the node ordering, an adjacency matrix can highlight distinct characteristics of a graph. Deriving a "proper" node ordering is thus a critical step in visualizing a graph as an adjacency matrix. Users often try multiple matrix reorderings using different methods until they find one that meets the analysis goal. However, this trial-and-error approach is laborious and disorganized, which is especially challenging for novices. This paper presents a technique that enables users to effortlessly find a matrix reordering they want. Specifically, we design a generative model that learns a latent space of diverse matrix reorderings of the given graph. We also construct an intuitive user interface from the learned latent space by creating a map of various matrix reorderings. We demonstrate our approach through quantitative and qualitative evaluations of the generated reorderings and learned latent spaces. The results show that our model is capable of learning a latent space of diverse matrix reorderings. Most existing research in this area generally focused on developing algorithms that can compute "better" matrix reorderings for particular circumstances. This paper introduces a fundamentally new approach to matrix visualization of a graph, where a machine learning model learns to generate diverse matrix reorderings of a graph.  ( 2 min )
    Deep Neural Networks and Tabular Data: A Survey. (arXiv:2110.01889v2 [cs.LG] UPDATED)
    Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their application to modeling tabular data (inference or generation) remains highly challenging. This work provides an overview of state of the art deep learning methods for tabular data. We start by categorizing them into three groups: data transformations, specialized architectures, and regularization models. We then provide a comprehensive overview of the main approaches in each group. A discussion of deep learning approaches for generating tabular data is complemented by strategies for explaining deep models on tabular data. Our primary contribution is to address the main research streams and existing methodologies in this area, while highlighting relevant challenges and open research questions. We also provide an empirical comparison of traditional machine learning methods with deep learning approaches on real tabular data sets of different sizes and with different learning objectives. Our results indicate that algorithms based on gradient-boosted tree ensembles still outperform the deep learning models. To the best of our knowledge, this is the first in-depth look at deep learning approaches for tabular data. This work can serve as a valuable starting point and guide for researchers and practitioners interested in deep learning with tabular data.  ( 2 min )
    Modeling Interactions of Autonomous Vehicles and Pedestrians with Deep Multi-Agent Reinforcement Learning for Collision Avoidance. (arXiv:2109.15266v2 [cs.RO] UPDATED)
    Reliable pedestrian crash avoidance mitigation (PCAM) systems are crucial components of safe autonomous vehicles (AVs). The nature of the vehicle-pedestrian interaction where decisions of one agent directly affect the other agent's optimal behavior, and vice versa, is a challenging yet often neglected aspect of such systems. We address this issue by defining a Markov decision process (MDP) for a simulated driving scenario in which an AV driving along an urban street faces a pedestrian trying to cross an unmarked crosswalk. The AV's PCAM decision policy is learned through deep reinforcement learning (DRL). Since modeling pedestrians realistically is challenging, we compare two levels of intelligent pedestrian behavior. While the baseline model follows a predefined strategy, our advanced pedestrian model is defined as a second DRL agent. This model captures continuous learning and the uncertainty inherent in human behavior, making the vehicle-pedestrian interaction a deep multi-agent reinforcement learning (DMARL) problem. We benchmark the developed PCAM systems according to the agents' collision rate and the resulting traffic flow efficiency with a focus on the influence of observation uncertainty or noise on the decision-making of the agents. The results show that the AV is able to completely mitigate collisions under the majority of the investigated conditions, and that the DRL pedestrian model indeed learns a more intelligent crossing behavior.  ( 2 min )
    Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach. (arXiv:2108.02731v2 [cs.LG] UPDATED)
    One of the challenges for multi-agent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. While exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterwards the learned decentralized policies, which depend only on agents' current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, where the actor-critic approach with over-parameterized neural networks is proposed. The convergence and sample complexity is established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network based MARL algorithm with network structure and provably convergence guarantee.  ( 2 min )
    Maximum n-times Coverage for Vaccine Design. (arXiv:2101.10902v4 [q-bio.QM] UPDATED)
    We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $\tau$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.
    Dissecting graph measure performance for node clustering in LFR parameter space. (arXiv:2202.09827v1 [cs.SI])
    Graph measures that express closeness or distance between nodes can be employed for graph nodes clustering using metric clustering algorithms. There are numerous measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR graph generator, we create a dataset of 11780 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for each considered measure. Based on this, we determine the best measure for each area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to recommend a particular measure to use for clustering.  ( 2 min )
    Towards Enabling Dynamic Convolution Neural Network Inference for Edge Intelligence. (arXiv:2202.09461v1 [cs.LG])
    Deep learning applications have achieved great success in numerous real-world applications. Deep learning models, especially Convolution Neural Networks (CNN) are often prototyped using FPGA because it offers high power efficiency and reconfigurability. The deployment of CNNs on FPGAs follows a design cycle that requires saving of model parameters in the on-chip memory during High-level synthesis (HLS). Recent advances in edge intelligence require CNN inference on edge network to increase throughput and reduce latency. To provide flexibility, dynamic parameter allocation to different mobile devices is required to implement either a predefined or defined on-the-fly CNN architecture. In this study, we present novel methodologies for dynamically streaming the model parameters at run-time to implement a traditional CNN architecture. We further propose a library-based approach to design scalable and dynamic distributed CNN inference on the fly leveraging partial-reconfiguration techniques, which is particularly suitable for resource-constrained edge devices. The proposed techniques are implemented on the Xilinx PYNQ-Z2 board to prove the concept by utilizing the LeNet-5 CNN model. The results show that the proposed methodologies are effective, with classification accuracy rates of 92%, 86%, and 94% respectively
    Recent Advances on Neural Network Pruning at Initialization. (arXiv:2103.06460v2 [cs.LG] UPDATED)
    Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to traditional pruning and discuss the open problems. Apart from the dedicated paper review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.
    Predictive Coding: Towards a Future of Deep Learning beyond Backpropagation?. (arXiv:2202.09467v1 [cs.NE])
    The backpropagation of error algorithm used to train deep neural networks has been fundamental to the successes of deep learning. However, it requires sequential backward updates and non-local computations, which make it challenging to parallelize at scale and is unlike how learning works in the brain. Neuroscience-inspired learning algorithms, however, such as \emph{predictive coding}, which utilize local learning, have the potential to overcome these limitations and advance beyond current deep learning technologies. While predictive coding originated in theoretical neuroscience as a model of information processing in the cortex, recent work has developed the idea into a general-purpose algorithm able to train neural networks using only local computations. In this survey, we review works that have contributed to this perspective and demonstrate the close theoretical connections between predictive coding and backpropagation, as well as works that highlight the multiple advantages of using predictive coding models over backpropagation-trained neural networks. Specifically, we show the substantially greater flexibility of predictive coding networks against equivalent deep neural networks, which can function as classifiers, generators, and associative memories simultaneously, and can be defined on arbitrary graph topologies. Finally, we review direct benchmarks of predictive coding networks on machine learning classification tasks, as well as its close connections to control theory and applications in robotics.
    Imbalanced Malware Images Classification: a CNN based Approach. (arXiv:1708.08042v2 [cs.CV] UPDATED)
    Deep convolutional neural networks (CNNs) can be applied to malware binary detection via image classification. The performance, however, is degraded due to the imbalance of malware families (classes). To mitigate this issue, we propose a simple yet effective weighted softmax loss which can be employed as the final layer of deep CNNs. The original softmax loss is weighted, and the weight value can be determined according to class size. A scaling parameter is also included in computing the weight. Proper selection of this parameter is studied and an empirical option is suggested. The weighted loss aims at alleviating the impact of data imbalance in an end-to-end learning fashion. To validate the efficacy, we deploy the proposed weighted loss in a pre-trained deep CNN model and fine-tune it to achieve promising results on malware images classification. Extensive experiments also demonstrate that the new loss function can well fit other typical CNNs, yielding an improved classification performance.
    PETCI: A Parallel English Translation Dataset of Chinese Idioms. (arXiv:2202.09509v1 [cs.CL])
    Idioms are an important language phenomenon in Chinese, but idiom translation is notoriously hard. Current machine translation models perform poorly on idiom translation, while idioms are sparse in many translation datasets. We present PETCI, a parallel English translation dataset of Chinese idioms, aiming to improve idiom translation by both human and machine. The dataset is built by leveraging human and machine effort. Baseline generation models show unsatisfactory abilities to improve translation, but structure-aware classification models show good performance on distinguishing good translations. Furthermore, the size of PETCI can be easily increased without expertise. Overall, PETCI can be helpful to language learners and machine translation systems.
    Automated Attack Synthesis by Extracting Finite State Machines from Protocol Specification Documents. (arXiv:2202.09470v1 [cs.CR])
    Automated attack discovery techniques, such as attacker synthesis or model-based fuzzing, provide powerful ways to ensure network protocols operate correctly and securely. Such techniques, in general, require a formal representation of the protocol, often in the form of a finite state machine (FSM). Unfortunately, many protocols are only described in English prose, and implementing even a simple network protocol as an FSM is time-consuming and prone to subtle logical errors. Automatically extracting protocol FSMs from documentation can significantly contribute to increased use of these techniques and result in more robust and secure protocol implementations. In this work we focus on attacker synthesis as a representative technique for protocol security, and on RFCs as a representative format for protocol prose description. Unlike other works that rely on rule-based approaches or use off-the-shelf NLP tools directly, we suggest a data-driven approach for extracting FSMs from RFC documents. Specifically, we use a hybrid approach consisting of three key steps: (1) large-scale word-representation learning for technical language, (2) focused zero-shot learning for mapping protocol text to a protocol-independent information language, and (3) rule-based mapping from protocol-independent information to a specific protocol FSM. We show the generalizability of our FSM extraction by using the RFCs for six different protocols: BGPv4, DCCP, LTP, PPTP, SCTP and TCP. We demonstrate how automated extraction of an FSM from an RFC can be applied to the synthesis of attacks, with TCP and DCCP as case-studies. Our approach shows that it is possible to automate attacker synthesis against protocols by using textual specifications such as RFCs.
    Improving Attribution Methods by Learning Submodular Functions. (arXiv:2104.09073v3 [cs.LG] UPDATED)
    This work explores the novel idea of learning a submodular scoring function to improve the specificity/selectivity of existing feature attribution methods. Submodular scores are natural for attribution as they are known to accurately model the principle of diminishing returns. A new formulation for learning a deep submodular set function that is consistent with the real-valued attribution maps obtained by existing attribution methods is proposed. The final attribution value of a feature is then defined as the marginal gain in the induced submodular score of the feature in the context of other highly attributed features, thus decreasing the attribution of redundant yet discriminatory features. Experiments on multiple datasets illustrate that the proposed attribution method achieves higher specificity along with good discriminative power. The implementation of our method is publicly available at https://github.com/Piyushi-0/SEA-NN.
    Explicit Group Sparse Projection with Applications to Deep Learning and NMF. (arXiv:1912.03896v3 [cs.LG] UPDATED)
    We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms). Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically. We show that the computational complexity of our projection operator is linear in the size of the problem. Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version. We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.
    Differentially Private Federated Learning via Inexact ADMM with Multiple Local Updates. (arXiv:2202.09409v1 [cs.LG])
    Differential privacy (DP) techniques can be applied to the federated learning model to statistically guarantee data privacy against inference attacks to communication among the learning agents. While ensuring strong data privacy, however, the DP techniques hinder achieving a greater learning performance. In this paper we develop a DP inexact alternating direction method of multipliers algorithm with multiple local updates for federated learning, where a sequence of convex subproblems is solved with the objective perturbation by random noises generated from a Laplace distribution. We show that our algorithm provides $\bar{\epsilon}$-DP for every iteration, where $\bar{\epsilon}$ is a privacy budget controlled by the user. We also present convergence analyses of the proposed algorithm. Using MNIST and FEMNIST datasets for the image classification, we demonstrate that our algorithm reduces the testing error by at most $31\%$ compared with the existing DP algorithm, while achieving the same level of data privacy. The numerical experiment also shows that our algorithm converges faster than the existing algorithm.
    Learning to Attack with Fewer Pixels: A Probabilistic Post-hoc Framework for Refining Arbitrary Dense Adversarial Attacks. (arXiv:2010.06131v2 [cs.CV] UPDATED)
    Deep neural network image classifiers are reported to be susceptible to adversarial evasion attacks, which use carefully crafted images created to mislead a classifier. Many adversarial attacks belong to the category of dense attacks, which generate adversarial examples by perturbing all the pixels of a natural image. To generate sparse perturbations, sparse attacks have been recently developed, which are usually independent attacks derived by modifying a dense attack's algorithm with sparsity regularisations, resulting in reduced attack efficiency. In this paper, we aim to tackle this task from a different perspective. We select the most effective perturbations from the ones generated from a dense attack, based on the fact we find that a considerable amount of the perturbations on an image generated by dense attacks may contribute little to attacking a classifier. Accordingly, we propose a probabilistic post-hoc framework that refines given dense attacks by significantly reducing the number of perturbed pixels but keeping their attack power, trained with mutual information maximisation. Given an arbitrary dense attack, the proposed model enjoys appealing compatibility for making its adversarial images more realistic and less detectable with fewer perturbations. Moreover, our framework performs adversarial attacks much faster than existing sparse attacks.
    Reciprocity in Machine Learning. (arXiv:2202.09480v1 [cs.LG])
    Machine learning is pervasive. It powers recommender systems such as Spotify, Instagram and YouTube, and health-care systems via models that predict sleep patterns, or the risk of disease. Individuals contribute data to these models and benefit from them. Are these contributions (outflows of influence) and benefits (inflows of influence) reciprocal? We propose measures of outflows, inflows and reciprocity building on previously proposed measures of training data influence. Our initial theoretical and empirical results indicate that under certain distributional assumptions, some classes of models are approximately reciprocal. We conclude with several open directions.
    Accurate Prediction and Uncertainty Estimation using Decoupled Prediction Interval Networks. (arXiv:2202.09664v1 [cs.LG])
    We propose a network architecture capable of reliably estimating uncertainty of regression based predictions without sacrificing accuracy. The current state-of-the-art uncertainty algorithms either fall short of achieving prediction accuracy comparable to the mean square error optimization or underestimate the variance of network predictions. We propose a decoupled network architecture that is capable of accomplishing both at the same time. We achieve this by breaking down the learning of prediction and prediction interval (PI) estimations into a two-stage training process. We use a custom loss function for learning a PI range around optimized mean estimation with a desired coverage of a proportion of the target labels within the PI range. We compare the proposed method with current state-of-the-art uncertainty quantification algorithms on synthetic datasets and UCI benchmarks, reducing the error in the predictions by 23 to 34% while maintaining 95% Prediction Interval Coverage Probability (PICP) for 7 out of 9 UCI benchmark datasets. We also examine the quality of our predictive uncertainty by evaluating on Active Learning and demonstrating 17 to 36% error reduction on UCI benchmarks.
    Selective Credit Assignment. (arXiv:2202.09699v1 [cs.LG])
    Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.
    Missing Data Infill with Automunge. (arXiv:2202.09484v1 [cs.LG])
    Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including "ML infill" in which auto ML models are trained for target features from partitioned extracts of a training set. A series of validation experiments were performed to benchmark imputation scenarios towards downstream model performance, in which it was found for the given benchmark sets that in many cases ML infill outperformed for both numeric and categoric target features, and was otherwise at minimum within noise distributions of the other imputation scenarios. Evidence also suggested supplementing ML infill with the addition of support columns with boolean integer markers signaling presence of infill was usually beneficial to downstream model performance. We consider these results sufficient to recommend defaulting to ML infill for tabular learning, and further recommend supplementing imputations with support columns signaling presence of infill, each as can be prepared with push-button operation in the Automunge library. Our contributions include an auto ML derived missing data imputation library for tabular learning in the python ecosystem, fully integrated into a preprocessing platform with an extensive library of feature transformations, with a novel production friendly implementation that bases imputation models on a designated train set for consistent basis towards additional data.
    Cross-Task Knowledge Distillation in Multi-Task Recommendation. (arXiv:2202.09852v1 [cs.IR])
    Multi-task learning has been widely used in real-world recommenders to predict different types of user feedback. Most prior works focus on designing network architectures for bottom layers as a means to share the knowledge about input features representations. However, since they adopt task-specific binary labels as supervised signals for training, the knowledge about how to accurately rank items is not fully shared across tasks. In this paper, we aim to enhance knowledge transfer for multi-task personalized recommendat optimization objectives. We propose a Cross-Task Knowledge Distillation (CrossDistil) framework in recommendation, which consists of three procedures. 1) Task Augmentation: We introduce auxiliary tasks with quadruplet loss functions to capture cross-task fine-grained ranking information, which could avoid task conflicts by preserving the cross-task consistent knowledge; 2) Knowledge Distillation: We design a knowledge distillation approach based on augmented tasks for sharing ranking knowledge, where tasks' predictions are aligned with a calibration process; 3) Model Training: Teacher and student models are trained in an end-to-end manner, with a novel error correction mechanism to speed up model training and improve knowledge quality. Comprehensive experiments on a public dataset and our production dataset are carried out to verify the effectiveness of CrossDistil as well as the necessity of its key components.
    Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data. (arXiv:2202.09518v1 [cs.DC])
    The need for efficient and scalable big-data analytics methods is more essential than ever due to the exploding size and complexity of globally emerging datasets. Nonnegative Matrix Factorization (NMF) is a well-known explainable unsupervised learning method for dimensionality reduction, latent feature extraction, blind source separation, data mining, and machine learning. In this paper, we introduce a new distributed out-of-memory NMF method, named pyDNMF-GPU, designed for modern heterogeneous CPU/GPU architectures that is capable of factoring exascale-sized dense and sparse matrices. Our method reduces the latency associated with local data transfer between the GPU and host using CUDA streams, and reduces the latency associated with collective communications (both intra-node and inter-node) via NCCL primitives. In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores, resulting in good scalability. We set new benchmarks for the size of the data being analyzed: in experiments, we measure up to 76x improvement on a single GPU over running on a single 18 core CPU and we show good weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs, when decomposing a dense 340 Terabyte-size matrix and a 11 Exabyte-size sparse matrix of density 10e-6. Finally, we integrate our method with an automatic model selection method. With this integration, we introduce a new tool that is capable of analyzing, compressing, and discovering explainable latent structures in extremely large sparse and dense data.
    Post-hoc Calibration of Neural Networks by g-Layers. (arXiv:2006.12807v2 [cs.LG] UPDATED)
    Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network ($f$) does not lead to the global optimum of NLL, by adding additional layers ($g$) and minimizing NLL by optimizing the parameters of $g$ one can obtain a calibrated network $g \circ f$. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.
    Overparametrization improves robustness against adversarial attacks: A replication study. (arXiv:2202.09735v1 [cs.LG])
    Overparametrization has become a de facto standard in machine learning. Despite numerous efforts, our understanding of how and where overparametrization helps model accuracy and robustness is still limited. To this end, here we conduct an empirical investigation to systemically study and replicate previous findings in this area, in particular the study by Madry et al. Together with this study, our findings support the "universal law of robustness" recently proposed by Bubeck et al. We argue that while critical for robust perception, overparametrization may not be enough to achieve full robustness and smarter architectures e.g. the ones implemented by the human visual cortex) seem inevitable.
    Generative Adversarial Networks for Data Generation in Structural Health Monitoring. (arXiv:2112.08196v1 [cs.LG] CROSS LISTED)
    Structural Health Monitoring (SHM) has been continuously benefiting from the advancements in the field of data science. Various types of Artificial Intelligence (AI) methods have been utilized for the assessment and evaluation of civil structures. In AI, Machine Learning (ML) and Deep Learning (DL) algorithms require plenty of datasets to train; particularly, the more data DL models are trained with, the better output it yields. Yet, in SHM applications, collecting data from civil structures through sensors is expensive and obtaining useful data (damage associated data) is challenging. In this paper, 1-D Wasserstein loss Deep Convolutional Generative Adversarial Networks using Gradient Penalty (1-D WDCGAN-GP) is utilized to generate damage associated vibration datasets that are similar to the input. For the purpose of vibration-based damage diagnostics, a 1-D Deep Convolutional Neural Network (1-D DCNN) is built, trained, and tested on both real and generated datasets. The classification results from the 1-D DCNN on both datasets resulted to be very similar to each other. The presented work in this paper shows that for the cases of insufficient data in DL or ML-based damage diagnostics, 1-D WDCGAN-GP can successfully generate data for the model to be trained on. Keywords: 1-D Generative Adversarial Networks (GAN), Deep Convolutional Generative Adversarial Networks (DCGAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), 1-D Convolutional Neural Networks (CNN), Structural Health Monitoring (SHM), Structural Damage Diagnostics, Structural Damage Detection
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v2 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modelling framework for unsupervised learning and identification for nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.
    Stochastic sparse adversarial attacks. (arXiv:2011.12423v4 [cs.LG] UPDATED)
    This paper introduces stochastic sparse adversarial attacks (SSAA), standing as simple, fast and purely noise-based targeted and untargeted attacks of neural network classifiers (NNC). SSAA offer new examples of sparse (or $L_0$) attacks for which only few methods have been proposed previously. These attacks are devised by exploiting a small-time expansion idea widely used for Markov processes. Experiments on small and large datasets (CIFAR-10 and ImageNet) illustrate several advantages of SSAA in comparison with the-state-of-the-art methods. For instance, in the untargeted case, our method called Voting Folded Gaussian Attack (VFGA) scales efficiently to ImageNet and achieves a significantly lower $L_0$ score than SparseFool (up to $\frac{2}{5}$) while being faster. Moreover, VFGA achieves better $L_0$ scores on ImageNet than Sparse-RS when both attacks are fully successful on a large number of samples.
    CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting. (arXiv:2109.07438v2 [cs.LG] UPDATED)
    Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and do not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.
    Disentangling Autoencoders (DAE). (arXiv:2202.09926v1 [cs.LG])
    Noting the importance of factorizing or disentangling the latent space, we propose a novel framework for autoencoders based on the principles of symmetry transformations in group-theory, which is a non-probabilistic disentangling autoencoder model. To the best of our knowledge, this is the first model that is aiming to achieve disentanglement based on autoencoders without regularizers. The proposed model is compared to seven state-of-the-art generative models based on autoencoders and evaluated based on reconstruction loss and five metrics quantifying disentanglement losses. The experiment results show that the proposed model can have better disentanglement when variances of each features are different. We believe that this model leads a new field for disentanglement learning based on autoencoders without regularizers.
    An Overview and Experimental Study of Learning-based Optimization Algorithms for Vehicle Routing Problem. (arXiv:2107.07076v2 [cs.LG] UPDATED)
    Vehicle routing problem (VRP) is a typical discrete combinatorial optimization problem, and many models and algorithms have been proposed to solve the VRP and its variants. Although existing approaches have contributed a lot to the development of this field, these approaches either are limited in problem size or need manual intervening in choosing parameters. To solve these difficulties, many studies have considered the learning-based optimization (LBO) algorithms to solve the VRP. This paper reviews recent advances in this field and divides relevant approaches into end-to-end approaches and step-by-step approaches. We performed a statistical analysis of the reviewed articles from various aspects and designed three experiments to evaluate the performance of four representative LBO algorithms. Finally, we conclude the applicable types of problems for different LBO algorithms and suggest directions in which researchers can improve LBO algorithms.
    Attacks, Defenses, And Tools: A Framework To Facilitate Robust AI/ML Systems. (arXiv:2202.09465v1 [cs.CR])
    Software systems are increasingly relying on Artificial Intelligence (AI) and Machine Learning (ML) components. The emerging popularity of AI techniques in various application domains attracts malicious actors and adversaries. Therefore, the developers of AI-enabled software systems need to take into account various novel cyber-attacks and vulnerabilities that these systems may be susceptible to. This paper presents a framework to characterize attacks and weaknesses associated with AI-enabled systems and provide mitigation techniques and defense strategies. This framework aims to support software designers in taking proactive measures in developing AI-enabled software, understanding the attack surface of such systems, and developing products that are resilient to various emerging attacks associated with ML. The developed framework covers a broad spectrum of attacks, mitigation techniques, and defensive and offensive tools. In this paper, we demonstrate the framework architecture and its major components, describe their attributes, and discuss the long-term goals of this research.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v1 [cs.LG])
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation. (arXiv:2202.09750v1 [cs.SD])
    The study of Music Cognition and neural responses to music has been invaluable in understanding human emotions. Brain signals, though, manifest a highly complex structure that makes processing and retrieving meaningful features challenging, particularly of abstract constructs like affect. Moreover, the performance of learning models is undermined by the limited amount of available neuronal data and their severe inter-subject variability. In this paper we extract efficient, personalized affective representations from EEG signals during music listening. To this end, we employ music signals as a supervisory modality to EEG, aiming to project their semantic correspondence onto a common representation space. We utilize a bi-modal framework by combining an LSTM-based attention model to process EEG and a pre-trained model for music tagging, along with a reverse domain discriminator to align the distributions of the two modalities, further constraining the learning process with emotion tags. The resulting framework can be utilized for emotion recognition both directly, by performing supervised predictions from either modality, and indirectly, by providing relevant music samples to EEG input queries. The experimental findings show the potential of enhancing neuronal data through stimulus information for recognition purposes and yield insights into the distribution and temporal variance of music-induced affective features.
    MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management. (arXiv:2102.03502v4 [q-fin.PM] UPDATED)
    Financial portfolio management (PM) is one of the most applicable problems in reinforcement learning (RL) owing to its sequential decision-making nature. However, existing RL-based approaches rarely focus on scalability or reusability to adapt to the ever-changing markets. These approaches are rigid and unscalable to accommodate the varying number of assets of portfolios and increasing need for heterogeneous data. Also, RL agents in the existing systems are ad-hoc trained and hardly reusable for different portfolios. To confront the above problems, a modular design is desired for the systems to be compatible with reusable asset-dedicated agents. In this paper, we propose a multi-agent RL-based system for PM (MSPM). MSPM involves two types of asynchronously-updated modules: Evolving Agent Module (EAM) and Strategic Agent Module (SAM). An EAM is an information-generating module with a DQN agent, and it receives heterogeneous data and generates signal-comprised information for a particular asset. An SAM is a decision-making module with a PPO agent for portfolio optimization, and it connects to EAMs to reallocate the assets in a portfolio. Trained EAMs can be connected to any SAM at will. With its modularized architecture, the multi-step condensation of volatile market information, and the reusable design of EAM, MSPM simultaneously addresses the two challenges in RL-based PM: scalability and reusability. Experiments on 8-year U.S. stock market data prove the effectiveness of MSPM in profit accumulation by its outperformance over five baselines in terms of accumulated rate of return (ARR), daily rate of return, and Sortino ratio. MSPM improves ARR by at least 186.5% compared to CRP, a widely-used PM strategy. To validate the indispensability of EAM, we back-test and compare MSPMs on four portfolios. EAM-enabled MSPMs improve ARR by at least 1341.8% compared to EAM-disabled MSPMs.
    Clustering by the Probability Distributions from Extreme Value Theory. (arXiv:2202.09784v1 [cs.LG])
    Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.
    Learning to Detect Slip with Barometric Tactile Sensors and a Temporal Convolutional Neural Network. (arXiv:2202.09549v1 [cs.RO])
    The ability to perceive object slip via tactile feedback enables humans to accomplish complex manipulation tasks including maintaining a stable grasp. Despite the utility of tactile information for many applications, tactile sensors have yet to be widely deployed in industrial robotics settings; part of the challenge lies in identifying slip and other events from the tactile data stream. In this paper, we present a learning-based method to detect slip using barometric tactile sensors. These sensors have many desirable properties including high durability and reliability, and are built from inexpensive, off-the-shelf components. We train a temporal convolution neural network to detect slip, achieving high detection accuracies while displaying robustness to the speed and direction of the slip motion. Further, we test our detector on two manipulation tasks involving a variety of common objects and demonstrate successful generalization to real-world scenarios not seen during training. We argue that barometric tactile sensing technology, combined with data-driven learning, is suitable for many manipulation tasks such as slip compensation.
    Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient Descent. (arXiv:2107.05011v2 [cs.LG] UPDATED)
    Data representation techniques have made a substantial contribution to advancing data processing and machine learning (ML). Improving predictive power was the focus of previous representation techniques, which unfortunately perform rather poorly on the interpretability in terms of extracting underlying insights of the data. Recently, the Kolmogorov model (KM) was studied, which is an interpretable and predictable representation approach to learning the underlying probabilistic structure of a set of random variables. The existing KM learning algorithms using semi-definite relaxation with randomization (SDRwR) or discrete monotonic optimization (DMO) have, however, limited utility to big data applications because they do not scale well computationally. In this paper, we propose a computationally scalable KM learning algorithm, based on the regularized dual optimization combined with enhanced gradient descent (GD) method. To make our method more scalable to large-dimensional problems, we propose two acceleration schemes, namely, the eigenvalue decomposition (EVD) elimination strategy and an approximate EVD algorithm. Furthermore, a thresholding technique by exploiting the error bound analysis and leveraging the normalized Minkowski $\ell_1$-norm, is provided for the selection of the number of iterations of the approximate EVD algorithm. When applied to big data applications, it is demonstrated that the proposed method can achieve compatible training/prediction performance with significantly reduced computational complexity; roughly two orders of magnitude improvement in terms of the time overhead, compared to the existing KM learning algorithms. Furthermore, it is shown that the accuracy of logical relation mining for interpretability by using the proposed KM learning algorithm exceeds $80\%$.  ( 2 min )
    Improving the Level of Autism Discrimination through GraphRNN Link Prediction. (arXiv:2202.09538v1 [cs.LG])
    Dataset is the key of deep learning in Autism disease research. However, due to the few quantity and heterogeneity of samples in current dataset, for example ABIDE (Autism Brain Imaging Data Exchange), the recognition research is not effective enough. Previous studies mostly focused on optimizing feature selection methods and data reinforcement to improve accuracy. This paper is based on the latter technique, which learns the edge distribution of real brain network through GraphRNN, and generates the synthetic data which has incentive effect on the discriminant model. The experimental results show that the combination of original and synthetic data greatly improves the discrimination of the neural network. For instance, the most significant effect is the 50-layer ResNet, and the best generation model is GraphRNN, which improves the accuracy by 32.51% compared with the model reference experiment without generation data reinforcement. Because the generated data comes from the learned edge connection distribution of Autism patients and typical controls functional connectivity, but it has better effect than the original data, which has constructive significance for further understanding of disease mechanism and development.  ( 2 min )
    Mining Robust Default Configurations for Resource-constrained AutoML. (arXiv:2202.09927v1 [cs.LG])
    Automatic machine learning (AutoML) is a key enabler of the mass deployment of the next generation of machine learning systems. A key desideratum for future ML systems is the automatic selection of models and hyperparameters. We present a novel method of selecting performant configurations for a given task by performing offline autoML and mining over a diverse set of tasks. By mining the training tasks, we can select a compact portfolio of configurations that perform well over a wide variety of tasks, as well as learn a strategy to select portfolio configurations for yet-unseen tasks. The algorithm runs in a zero-shot manner, that is without training any models online except the chosen one. In a compute- or time-constrained setting, this virtually instant selection is highly performant. Further, we show that our approach is effective for warm-starting existing autoML platforms. In both settings, we demonstrate an improvement on the state-of-the-art by testing over 62 classification and regression datasets. We also demonstrate the utility of recommending data-dependent default configurations that outperform widely used hand-crafted defaults.  ( 2 min )
    Zeroth-Order Methods for Convex-Concave Minmax Problems: Applications to Decision-Dependent Risk Minimization. (arXiv:2106.09082v2 [math.OC] UPDATED)
    Min-max optimization is emerging as a key framework for analyzing problems of robustness to strategically and adversarially generated data. We propose a random reshuffling-based gradient free Optimistic Gradient Descent-Ascent algorithm for solving convex-concave min-max problems with finite sum structure. We prove that the algorithm enjoys the same convergence rate as that of zeroth-order algorithms for convex minimization problems. We further specialize the algorithm to solve distributionally robust, decision-dependent learning problems, where gradient information is not readily available. Through illustrative simulations, we observe that our proposed approach learns models that are simultaneously robust against adversarial distribution shifts and strategic decisions from the data sources, and outperforms existing methods from the strategic classification literature.  ( 2 min )
    Signal Processing on Higher-Order Networks: Livin' on the Edge ... and Beyond. (arXiv:2101.05510v4 [cs.SI] UPDATED)
    In this tutorial, we provide a didactic treatment of the emerging topic of signal processing on higher-order networks. Drawing analogies from discrete and graph signal processing, we introduce the building blocks for processing data on simplicial complexes and hypergraphs, two common higher-order network abstractions that can incorporate polyadic relationships. We provide brief introductions to simplicial complexes and hypergraphs, with a special emphasis on the concepts needed for the processing of signals supported on these structures. Specifically, we discuss Fourier analysis, signal denoising, signal interpolation, node embeddings, and nonlinear processing through neural networks, using these two higher-order network models. In the context of simplicial complexes, we specifically focus on signal processing using the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing. For hypergraphs, we present both matrix and tensor representations, and discuss the trade-offs in adopting one or the other. We also highlight limitations and potential research avenues, both to inform practitioners and to motivate the contribution of new researchers to the area.  ( 2 min )
    Towards Scalable and Privacy-Preserving Deep Neural Network via Algorithmic-Cryptographic Co-design. (arXiv:2012.09364v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have achieved remarkable progress in various real-world applications, especially when abundant training data are provided. However, data isolation has become a serious problem currently. Existing works build privacy preserving DNN models from either algorithmic perspective or cryptographic perspective. The former mainly splits the DNN computation graph between data holders or between data holders and server, which demonstrates good scalability but suffers from accuracy loss and potential privacy risks. In contrast, the latter leverages time-consuming cryptographic techniques, which has strong privacy guarantee but poor scalability. In this paper, we propose SPNN - a Scalable and Privacy-preserving deep Neural Network learning framework, from algorithmic-cryptographic co-perspective. From algorithmic perspective, we split the computation graph of DNN models into two parts, i.e., the private data related computations that are performed by data holders and the rest heavy computations that are delegated to a server with high computation ability. From cryptographic perspective, we propose using two types of cryptographic techniques, i.e., secret sharing and homomorphic encryption, for the isolated data holders to conduct private data related computations privately and cooperatively. Furthermore, we implement SPNN in a decentralized setting and introduce user-friendly APIs. Experimental results conducted on real-world datasets demonstrate the superiority of SPNN.  ( 2 min )
    Spatio-Temporal EEG Representation Learning on Riemannian Manifold and Euclidean Space. (arXiv:2008.08633v2 [cs.CV] UPDATED)
    We present a novel deep neural architecture for learning Electroencephalogram (EEG). To learn the spatial information, our model first obtains the Riemannian mean and distance from Spatial Covariance Matrices (SCMs) on the Riemannian manifold. We then project the spatial information onto the Euclidean space via tangent space learning. Following, two fully connected layers are used to learn the spatial information embeddings. Moreover, our proposed method learns the temporal information via differential entropy and logarithm power spectrum density features extracted from EEG signals in Euclidean space using a deep long short-term memory network with a soft attention mechanism. To combine the spatial and temporal information, we use an effective fusion strategy, which learns attention weights applied to embedding-specific features for decision making. We evaluate our proposed framework on four public datasets across three popular EEG-related tasks, notably emotion recognition, vigilance estimation, and motor imagery classification, containing various types of tasks such as binary classification, multi-class classification, and regression. Our proposed architecture approaches the state-of-the-art on one dataset (SEED) and outperforms other methods on the other three datasets (SEED-VIG, BCI-IV 2A, and BCI-IV 2B), setting new state-of-the-art values and showing the robustness of our framework in EEG representation learning. The source code of our paper is publicly available at https://github.com/guangyizhangbci/EEG_Riemannian.  ( 2 min )
    SRL-SOA: Self-Representation Learning with Sparse 1D-Operational Autoencoder for Hyperspectral Image Band Selection. (arXiv:2202.09918v1 [cs.CV])
    The band selection in the hyperspectral image (HSI) data processing is an important task considering its effect on the computational complexity and accuracy. In this work, we propose a novel framework for the band selection problem: Self-Representation Learning (SRL) with Sparse 1D-Operational Autoencoder (SOA). The proposed SLR-SOA approach introduces a novel autoencoder model, SOA, that is designed to learn a representation domain where the data are sparsely represented. Moreover, the network composes of 1D-operational layers with the non-linear neuron model. Hence, the learning capability of neurons (filters) is greatly improved with shallow architectures. Using compact architectures is especially crucial in autoencoders as they tend to overfit easily because of their identity mapping objective. Overall, we show that the proposed SRL-SOA band selection approach outperforms the competing methods over two HSI data including Indian Pines and Salinas-A considering the achieved land cover classification accuracies. The software implementation of the SRL-SOA approach is shared publicly at https://github.com/meteahishali/SRL-SOA.  ( 2 min )
    Opening the Black Box of Deep Neural Networks in Physical Layer Communication. (arXiv:2106.01124v3 [eess.SP] UPDATED)
    Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantitatively analyze why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is flown in a DNN-based communication system under the information theoretic concepts.  ( 2 min )
    CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application. (arXiv:2008.09264v4 [eess.AS] UPDATED)
    This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.  ( 3 min )
    Double-descent curves in neural networks: a new perspective using Gaussian processes. (arXiv:2102.07238v4 [stat.ML] UPDATED)
    Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite-width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that the fact that the generalisation error of neural networks decreases in the overparameterized regime and has a finite theoretical value is explained by the convergence to their limiting Gaussian processes. Our analysis thus provides a mathematical explanation for a surprising phenomenon that could not explained by conventional statistical learning theory. However, understanding what drives these finite theoretical values to be the state-of-the-art generalisation performances in many applications remains an open question, for which we only provide new leads in this paper.  ( 2 min )
    MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. (arXiv:2102.00457v4 [cs.LG] UPDATED)
    We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art performance with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.  ( 2 min )
    Routine Clustering of Mobile Sensor Data Facilitates Psychotic Relapse Prediction in Schizophrenia Patients. (arXiv:2106.11487v2 [cs.LG] UPDATED)
    We aim to develop clustering models to obtain behavioral representations from continuous multimodal mobile sensing data towards relapse prediction tasks. The identified clusters could represent different routine behavioral trends related to daily living of patients as well as atypical behavioral trends associated with impending relapse. We used the mobile sensing data obtained in the CrossCheck project for our analysis. Continuous data from six different mobile sensing-based modalities (e.g. ambient light, sound/conversation, acceleration etc.) obtained from a total of 63 schizophrenia patients, each monitored for up to a year, were used for the clustering models and relapse prediction evaluation. Two clustering models, Gaussian Mixture Model (GMM) and Partition Around Medoids (PAM), were used to obtain behavioral representations from the mobile sensing data. The features obtained from the clustering models were used to train and evaluate a personalized relapse prediction model using Balanced Random Forest. The personalization was done by identifying optimal features for a given patient based on a personalization subset consisting of other patients who are of similar age. The clusters identified using the GMM and PAM models were found to represent different behavioral patterns (such as clusters representing sedentary days, active but with low communications days, etc.). Significant changes near the relapse periods were seen in the obtained behavioral representation features from the clustering models. The clustering model based features, together with other features characterizing the mobile sensing data, resulted in an F2 score of 0.24 for the relapse prediction task in a leave-one-patient-out evaluation setting. This obtained F2 score is significantly higher than a random classification baseline with an average F2 score of 0.042.  ( 2 min )
    Adversarially Robust Kernel Smoothing. (arXiv:2102.08474v4 [cs.LG] UPDATED)
    We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.  ( 2 min )
    Image Response Regression via Deep Neural Networks. (arXiv:2006.09911v3 [stat.ML] UPDATED)
    Delineating the associations between images and a vector of covariates is of central interest in medical imaging studies. To tackle this problem of image response regression, we propose a novel nonparametric approach in the framework of spatially varying coefficient models, where the spatially varying functions are estimated through deep neural networks. Compared to existing solutions, the proposed method explicitly accounts for spatial smoothness and subject heterogeneity, has straightforward interpretations, and is highly flexible and accurate in capturing complex association patterns. A key idea in our approach is to treat the image voxels as the effective samples, which not only alleviates the limited sample size issue that haunts the majority of medical imaging studies, but also leads to more robust and reproducible results. Focusing on a broad family of piecewise smooth functions, we establish the estimation and selection consistency, and derive the asymptotic error bounds. We demonstrate the efficacy of the method through intensive simulations, and further illustrate its advantages with analyses of two functional magnetic resonance imaging datasets.  ( 2 min )
    Generalizing Dynamic Mode Decomposition: Balancing Accuracy and Expressiveness in Koopman Approximations. (arXiv:2108.03712v2 [eess.SY] UPDATED)
    This paper tackles the data-driven approximation of unknown dynamical systems using Koopman-operator methods. Given a dictionary of functions, these methods approximate the projection of the action of the operator on the finite-dimensional subspace spanned by the dictionary. We propose the Tunable Symmetric Subspace Decomposition algorithm to refine the dictionary, balancing its expressiveness and accuracy. Expressiveness corresponds to the ability of the dictionary to describe the evolution of as many observables as possible and accuracy corresponds to the ability to correctly predict their evolution. Based on the observation that Koopman-invariant subspaces give rise to exact predictions, we reason that prediction accuracy is a function of the degree of invariance of the subspace generated by the dictionary and provide a data-driven measure to measure invariance proximity. The proposed algorithm iteratively prunes the initial functional space to identify a refined dictionary of functions that satisfies the desired level of accuracy while retaining as much of the original expressiveness as possible. We provide a full characterization of the algorithm properties and show that it generalizes both Extended Dynamic Mode Decomposition and Symmetric Subspace Decomposition. Simulations on planar systems show the effectiveness of the proposed methods in producing Koopman approximations of tunable accuracy that capture relevant information about the dynamical system.  ( 2 min )
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v1 [stat.ML])
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize the variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.  ( 2 min )
    A Meta-embedding-based Ensemble Approach for ICD Coding Prediction. (arXiv:2102.13622v2 [cs.CL] UPDATED)
    International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. The problem of automatically assigning ICD codes has been approached in literature as a multilabel classification, using neural models on unstructured data. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles. Furthermore, we exploit the geometric properties of the two sets of word vectors and combine them into a common dimensional space, using meta-embedding techniques. We demonstrate the efficacy of this approach for a multimodal setting, using unstructured and structured information. We empirically show that our approach improves the current state-of-the-art deep learning architectures and benefits ensemble models.  ( 2 min )
    Sqrt(d) Dimension Dependence of Langevin Monte Carlo. (arXiv:2109.03839v3 [cs.LG] UPDATED)
    This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2-Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an $\tilde{O}(\sqrt{d}/\epsilon)$ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known $\tilde{O}(d/\epsilon)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.  ( 2 min )
    Massive feature extraction for explaining and foretelling hydroclimatic time series forecastability at the global scale. (arXiv:2108.00846v2 [physics.ao-ph] UPDATED)
    Statistical analyses and descriptive characterizations are sometimes assumed to be offering information on time series forecastability. Despite the scientific interest suggested by such assumptions, the relationships between descriptive time series features (e.g., temporal dependence, entropy, seasonality, trend and linearity features) and actual time series forecastability (quantified by issuing and assessing forecasts for the past) are scarcely studied and quantified in the literature. In this work, we aim to fill in this gap by investigating such relationships, and the way that they can be exploited for understanding hydroclimatic forecastability and its patterns. To this end, we follow a systematic framework bringing together a variety of -- mostly new for hydrology -- concepts and methods, including 57 descriptive features and nine seasonal time series forecasting methods (i.e., one simple, five exponential smoothing, two state space and one automated autoregressive fractionally integrated moving average methods). We apply this framework to three global datasets originating from the larger Global Historical Climatology Network (GHCN) and Global Streamflow Indices and Metadata (GSIM) archives. As these datasets comprise over 13 000 monthly temperature, precipitation and river flow time series from several continents and hydroclimatic regimes, they allow us to provide trustable characterizations and interpretations of 12-month ahead hydroclimatic forecastability at the global scale...  ( 2 min )
    ReconResNet: Regularised Residual Learning for MR Image Reconstruction of Undersampled Cartesian and Radial Data. (arXiv:2103.09203v2 [eess.IV] UPDATED)
    MRI is an inherently slow process, which leads to long scan time for high-resolution imaging. The speed of acquisition can be increased by ignoring parts of the data (undersampling). Consequently, this leads to the degradation of image quality, such as loss of resolution or introduction of image artefacts. This work aims to reconstruct highly undersampled Cartesian or radial MR acquisitions, with better resolution and with less to no artefact compared to conventional techniques like compressed sensing. In recent times, deep learning has emerged as a very important area of research and has shown immense potential in solving inverse problems, e.g. MR image reconstruction. In this paper, a deep learning based MR image reconstruction framework is proposed, which includes a modified regularised version of ResNet as the network backbone to remove artefacts from the undersampled image, followed by data consistency steps that fusions the network output with the data already available from undersampled k-space in order to further improve reconstruction quality. The performance of this framework for various undersampling patterns has also been tested, and it has been observed that the framework is robust to deal with various sampling patterns, even when mixed together while training, and results in very high quality reconstruction, in terms of high SSIM (highest being 0.990$\pm$0.006 for acceleration factor of 3.5), while being compared with the fully sampled reconstruction. It has been shown that the proposed framework can successfully reconstruct even for an acceleration factor of 20 for Cartesian (0.968$\pm$0.005) and 17 for radially (0.962$\pm$0.012) sampled data. Furthermore, it has been shown that the framework preserves brain pathology during reconstruction while being trained on healthy subjects.  ( 3 min )
    Adversarial Examples in Constrained Domains. (arXiv:2011.01183v2 [cs.CR] UPDATED)
    Machine learning algorithms have been shown to be vulnerable to adversarial manipulation through systematic modification of inputs (e.g., adversarial examples) in domains such as image recognition. Under the default threat model, the adversary exploits the unconstrained nature of images; each feature (pixel) is fully under control of the adversary. However, it is not clear how these attacks translate to constrained domains that limit which and how features can be modified by the adversary (e.g., network intrusion detection). In this paper, we explore whether constrained domains are less vulnerable than unconstrained domains to adversarial example generation algorithms. We create an algorithm for generating adversarial sketches: targeted universal perturbation vectors which encode feature saliency within the envelope of domain constraints. To assess how these algorithms perform, we evaluate them in constrained (e.g., network intrusion detection) and unconstrained (e.g., image recognition) domains. The results demonstrate that our approaches generate misclassification rates in constrained domains that were comparable to those of unconstrained domains (greater than 95%). Our investigation shows that the narrow attack surface exposed by constrained domains is still sufficiently large to craft successful adversarial examples; and thus, constraints do not appear to make a domain robust. Indeed, with as little as five randomly selected features, one can still generate adversarial examples.  ( 2 min )
    Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm. (arXiv:2101.10506v2 [cs.LG] UPDATED)
    Actor-critic style two-time-scale algorithms are one of the most popular methods in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the \emph{global} convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory of samples. Our analysis applies to very general settings, as we only assume ergodicity of the underlying Markov decision process. In order to ensure enough exploration, we employ an $\epsilon$-greedy sampling of the trajectory. For a fixed and small enough exploration parameter $\epsilon$, we show that the two-time-scale natural actor-critic algorithm has a rate of convergence of $\tilde{\mathcal{O}}(1/T^{1/4})$, where $T$ is the number of samples, and this leads to a sample complexity of $\Tilde{\mathcal{O}}(1/\delta^{8})$ samples to find a policy that is within an error of $\delta$ from the \emph{global optimum}. Moreover, by carefully decreasing the exploration parameter $\epsilon$ as the iterations proceed, we present an improved sample complexity of $\Tilde{\mathcal{O}}(1/\delta^{6})$ for convergence to the global optimum.  ( 2 min )
    The four-fifths rule is not disparate impact: a woeful tale of epistemic trespassing in algorithmic fairness. (arXiv:2202.09519v1 [cs.CY])
    Computer scientists are trained to create abstractions that simplify and generalize. However, a premature abstraction that omits crucial contextual details creates the risk of epistemic trespassing, by falsely asserting its relevance into other contexts. We study how the field of responsible AI has created an imperfect synecdoche by abstracting the four-fifths rule (a.k.a. the 4/5 rule or 80% rule), a single part of disparate impact discrimination law, into the disparate impact metric. This metric incorrectly introduces a new deontic nuance and new potentials for ethical harms that were absent in the original 4/5 rule. We also survey how the field has amplified the potential for harm in codifying the 4/5 rule into popular AI fairness software toolkits. The harmful erasure of legal nuances is a wake-up call for computer scientists to self-critically re-evaluate the abstractions they create and use, particularly in the interdisciplinary field of AI ethics.  ( 2 min )
    Video Reenactment as Inductive Bias for Content-Motion Disentanglement. (arXiv:2102.00324v3 [cs.CV] UPDATED)
    Independent components within low-dimensional representations are essential inputs in several downstream tasks, and provide explanations over the observed data. Video-based disentangled factors of variation provide low-dimensional representations that can be identified and used to feed task-specific models. We introduce MTC-VAE, a self-supervised motion-transfer VAE model to disentangle motion and content from videos. Unlike previous work on video content-motion disentanglement, we adopt a chunk-wise modeling approach and take advantage of the motion information contained in spatiotemporal neighborhoods. Our model yields independent per-chunk representations that preserve temporal consistency. Hence, we reconstruct whole videos in a single forward-pass. We extend the ELBO's log-likelihood term and include a Blind Reenactment Loss as an inductive bias to leverage motion disentanglement, under the assumption that swapping motion features yields reenactment between two videos. We evaluate our model with recently-proposed disentanglement metrics and show that it outperforms a variety of methods for video motion-content disentanglement. Experiments on video reenactment show the effectiveness of our disentanglement in the input space where our model outperforms the baselines in reconstruction quality and motion alignment.  ( 2 min )
    Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization. (arXiv:2103.01499v2 [cs.LG] UPDATED)
    Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks.  ( 2 min )
    Suitability of Different Metric Choices for Concept Drift Detection. (arXiv:2202.09486v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the distribution, which is underlying the observed data, changes over time; as a consequence machine learning models may become inaccurate and need adjustment. Many unsupervised approaches for drift detection rely on measuring the discrepancy between the sample distributions of two time windows. This may be done directly, after some preprocessing (feature extraction, embedding into a latent space, etc.), or with respect to inferred features (mean, variance, conditional probabilities etc.). Most drift detection methods can be distinguished in what metric they use, how this metric is estimated, and how the decision threshold is found. In this paper, we analyze structural properties of the drift induced signals in the context of different metrics. We compare different types of estimators and metrics theoretically and empirically and investigate the relevance of the single metric components. In addition, we propose new choices and demonstrate their suitability in several experiments.  ( 2 min )
    On the Complexity of Approximating Multimarginal Optimal Transport. (arXiv:1910.00152v3 [stat.ML] UPDATED)
    We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{O}(n^{3m})$. We then propose two simple and \textit{deterministic} algorithms for approximating the MOT problem. The first algorithm, which we refer to as \textit{multimarginal Sinkhorn} algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{O}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first \textit{near-linear time} complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as \textit{accelerated multimarginal Sinkhorn} algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{O}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm~\citep{Tupitsa-2020-Multimarginal} in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver \textsc{Gurobi}. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms.  ( 2 min )
    Natural Language Understanding for Argumentative Dialogue Systems in the Opinion Building Domain. (arXiv:2103.02691v2 [cs.CL] UPDATED)
    This paper introduces a natural language understanding (NLU) framework for argumentative dialogue systems in the information-seeking and opinion building domain. The proposed framework consists of two sub-models, namely intent classifier and argument similarity. Intent classifier model stacks BiLSTM with attention mechanism on top of the pre-trained BERT model and fine-tune the model for recognizing the user intent, whereas the argument similarity model employs BERT+BiLSTM for identifying system arguments the user refers to in his or her natural language utterances. Our model is evaluated in an argumentative dialogue system that engages the user to inform him-/herself about a controversial topic by exploring pro and con arguments and build his/her opinion towards the topic. In order to evaluate the proposed approach, we collect user utterances for the interaction with the respective system labeling intent and referenced argument in an extensive online study. The data collection includes multiple topics and two different user types (native English speakers from the UK and non-native English speakers from China). Additionally, we evaluate the proposed intent classifier and argument similarity models separately on the publicly available Banking77 and STS benchmark datasets. The evaluation indicates a clear advantage of the utilized techniques over baseline approaches on several datasets, as well as the robustness of the proposed approach against new topics and different language proficiency as well as the cultural background of the user. Furthermore, results show that our intent classifier model outperforms DIET, DistillBERT, and BERT fine-tuned models in few-shot setups (i.e., with 10, 20, or 30 labeled examples per intent) and full data setup.  ( 2 min )
    Efficacy of regularized multi-task learning based on SVM models. (arXiv:1805.12507v2 [stat.ML] UPDATED)
    This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small.  ( 2 min )
    Alternative design of DeepPDNet in the context of image restoration. (arXiv:2202.09810v1 [eess.IV])
    This work designs an image restoration deep network relying on unfolded Chambolle-Pock primal-dual iterations. Each layer of our network is built from Chambolle-Pock iterations when specified for minimizing a sum of a $\ell_2$-norm data-term and an analysis sparse prior. The parameters of our network are the step-sizes of the Chambolle-Pock scheme and the linear operator involved in sparsity-based penalization, including implicitly the regularization parameter. A backpropagation procedure is fully described. Preliminary experiments illustrate the good behavior of such a deep primal-dual network in the context of image restoration on BSD68 database.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v1 [stat.ML])
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Personalized Federated Learning with Exact Stochastic Gradient Descent. (arXiv:2202.09848v1 [cs.LG])
    In Federated Learning (FL), datasets across clients tend to be heterogeneous or personalized, and this poses challenges to the convergence of standard FL schemes that do not account for personalization. To address this, we present a new approach for personalized FL that achieves exact stochastic gradient descent (SGD) minimization. We start from the FedPer (Arivazhagan et al., 2019) neural network (NN) architecture for personalization, whereby the NN has two types of layers: the first ones are the common layers across clients, while the few final ones are client-specific and are needed for personalization. We propose a novel SGD-type scheme where, at each optimization round, randomly selected clients perform gradient-descent updates over their client-specific weights towards optimizing the loss function on their own datasets, without updating the common weights. At the final update, each client computes the joint gradient over both client-specific and common weights and returns the gradient of common parameters to the server. This allows to perform an exact and unbiased SGD step over the full set of parameters in a distributed manner, i.e. the updates of the personalized parameters are performed by the clients and those of the common ones by the server. Our method is superior to FedAvg and FedPer baselines in multi-class classification benchmarks such as Omniglot, CIFAR-10, MNIST, Fashion-MNIST, and EMNIST and has much lower computational complexity per round.  ( 2 min )
    $\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning. (arXiv:2202.09817v1 [cs.CL])
    With the success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Although some parameter-efficient tuning paradigms have been proposed to address this problem, they still require large resources to compute the gradients in the training phase. In this paper, we propose $\mathcal{Y}$-Tuning, an efficient yet effective paradigm to adapt frozen large-scale PTMs to specific downstream tasks. $\mathcal{Y}$-tuning learns dense representations for labels $\mathcal{Y}$ defined in a given task and aligns them to fixed feature representation. Without tuning the features of input text and model parameters, $\mathcal{Y}$-tuning is both parameter-efficient and training-efficient. For $\text{DeBERTa}_\text{XXL}$ with 1.6 billion parameters, $\mathcal{Y}$-tuning achieves performance more than $96\%$ of full fine-tuning on GLUE Benchmark with only $2\%$ tunable parameters and much fewer training costs.  ( 2 min )
    It's Raw! Audio Generation with State-Space Models. (arXiv:2202.09729v1 [cs.SD])
    Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples.  ( 2 min )
    Image-to-Graph Transformers for Chemical Structure Recognition. (arXiv:2202.09580v1 [cs.CV])
    For several decades, chemical knowledge has been published in written text, and there have been many attempts to make it accessible, for example, by transforming such natural language text to a structured format. Although the discovered chemical itself commonly represented in an image is the most important part, the correct recognition of the molecular structure from the image in literature still remains a hard problem since they are often abbreviated to reduce the complexity and drawn in many different styles. In this paper, we present a deep learning model to extract molecular structures from images. The proposed model is designed to transform the molecular image directly into the corresponding graph, which makes it capable of handling non-atomic symbols for abbreviations. Also, by end-to-end learning approach it can fully utilize many open image-molecule pair data from various sources, and hence it is more robust to image style variation than other tools. The experimental results show that the proposed model outperforms the existing models with 17.1 % and 12.8 % relative improvement for well-known benchmark datasets and large molecular images that we collected from literature, respectively.  ( 2 min )
    Graph Reparameterizations for Enabling 1000+ Monte Carlo Iterations in Bayesian Deep Neural Networks. (arXiv:2202.09478v1 [cs.LG])
    Uncertainty estimation in deep models is essential in many real-world applications and has benefited from developments over the last several years. Recent evidence suggests that existing solutions dependent on simple Gaussian formulations may not be sufficient. However, moving to other distributions necessitates Monte Carlo (MC) sampling to estimate quantities such as the KL divergence: it could be expensive and scales poorly as the dimensions of both the input data and the model grow. This is directly related to the structure of the computation graph, which can grow linearly as a function of the number of MC samples needed. Here, we construct a framework to describe these computation graphs, and identify probability families where the graph size can be independent or only weakly dependent on the number of MC samples. These families correspond directly to large classes of distributions. Empirically, we can run a much larger number of iterations for MC approximations for larger architectures used in computer vision with gains in performance measured in confident accuracy, stability of training, memory and training time.  ( 2 min )
    SOInter: A Novel Deep Energy Based Interpretation Method for Explaining Structured Output Models. (arXiv:2202.09914v1 [cs.LG])
    We propose a novel interpretation technique to explain the behavior of structured output models, which learn mappings between an input vector to a set of output variables simultaneously. Because of the complex relationship between the computational path of output variables in structured models, a feature can affect the value of output through other ones. We focus on one of the outputs as the target and try to find the most important features utilized by the structured model to decide on the target in each locality of the input space. In this paper, we assume an arbitrary structured output model is available as a black box and argue how considering the correlations between output variables can improve the explanation performance. The goal is to train a function as an interpreter for the target output variable over the input space. We introduce an energy-based training process for the interpreter function, which effectively considers the structural information incorporated into the model to be explained. The effectiveness of the proposed method is confirmed using a variety of simulated and real data sets.  ( 2 min )
    Efficient Continual Learning Ensembles in Neural Network Subspaces. (arXiv:2202.09826v1 [cs.LG])
    A growing body of research in continual learning focuses on the catastrophic forgetting problem. While many attempts have been made to alleviate this problem, the majority of the methods assume a single model in the continual learning setup. In this work, we question this assumption and show that employing ensemble models can be a simple yet effective method to improve continual performance. However, the training and inference cost of ensembles can increase linearly with the number of models. Motivated by this limitation, we leverage the recent advances in the deep learning optimization literature, such as mode connectivity and neural network subspaces, to derive a new method that is both computationally advantageous and can outperform the state-of-the-art continual learning algorithms.  ( 2 min )
    Understanding Robust Generalization in Learning Regular Languages. (arXiv:2202.09717v1 [cs.LG])
    A key feature of human intelligence is the ability to generalize beyond the training distribution, for instance, parsing longer sentences than seen in the past. Currently, deep neural networks struggle to generalize robustly to such shifts in the data distribution. We study robust generalization in the context of using recurrent neural networks (RNNs) to learn regular languages. We hypothesize that standard end-to-end modeling strategies cannot generalize well to systematic distribution shifts and propose a compositional strategy to address this. We compare an end-to-end strategy that maps strings to labels with a compositional strategy that predicts the structure of the deterministic finite-state automaton (DFA) that accepts the regular language. We theoretically prove that the compositional strategy generalizes significantly better than the end-to-end strategy. In our experiments, we implement the compositional strategy via an auxiliary task where the goal is to predict the intermediate states visited by the DFA when parsing a string. Our empirical results support our hypothesis, showing that auxiliary tasks can enable robust generalization. Interestingly, the end-to-end RNN generalizes significantly better than the theoretical lower bound, suggesting that it is able to achieve at least some degree of robust generalization.  ( 2 min )
    Generalized Bayesian Additive Regression Trees Models: Beyond Conditional Conjugacy. (arXiv:2202.09924v1 [stat.ML])
    Bayesian additive regression trees have seen increased interest in recent years due to their ability to combine machine learning techniques with principled uncertainty quantification. The Bayesian backfitting algorithm used to fit BART models, however, limits their application to a small class of models for which conditional conjugacy exists. In this article, we greatly expand the domain of applicability of BART to arbitrary \emph{generalized BART} models by introducing a very simple, tuning-parameter-free, reversible jump Markov chain Monte Carlo algorithm. Our algorithm requires only that the user be able to compute the likelihood and (optionally) its gradient and Fisher information. The potential applications are very broad; we consider examples in survival analysis, structured heteroskedastic regression, and gamma shape regression.  ( 2 min )
    On Optimal Early Stopping: Over-informative versus Under-informative Parametrization. (arXiv:2202.09885v1 [cs.LG])
    Early stopping is a simple and widely used method to prevent over-training neural networks. We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models. Our results demonstrate two very different behaviors when the model dimension exceeds the number of features versus the opposite scenario. While most previous works on linear models focus on the latter setting, we observe that the dimension of the model often exceeds the number of features arising from data in common deep learning tasks and propose a model to study this setting. We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.  ( 2 min )
    A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm. (arXiv:2202.09688v1 [math.OC])
    In this work, we consider strongly convex strongly concave (SCSC) saddle point (SP) problems $\min_{x\in\mathbb{R}^{d_x}}\max_{y\in\mathbb{R}^{d_y}}f(x,y)$ where $f$ is $L$-smooth, $f(.,y)$ is $\mu$-strongly convex for every $y$, and $f(x,.)$ is $\mu$-strongly concave for every $x$. Such problems arise frequently in machine learning in the context of robust empirical risk minimization (ERM), e.g. $\textit{distributionally robust}$ ERM, where partial gradients are estimated using mini-batches of data points. Assuming we have access to an unbiased stochastic first-order oracle we consider the stochastic accelerated primal dual (SAPD) algorithm recently introduced in Zhang et al. [2021] for SCSC SP problems as a robust method against gradient noise. In particular, SAPD recovers the well-known stochastic gradient descent ascent (SGDA) as a special case when the momentum parameter is set to zero and can achieve an accelerated rate when the momentum parameter is properly tuned, i.e., improving the $\kappa \triangleq L/\mu$ dependence from $\kappa^2$ for SGDA to $\kappa$. We propose efficient variance-reduction strategies for SAPD based on Richardson-Romberg extrapolation and show that our method improves upon SAPD both in practice and in theory.  ( 2 min )
    ExAIS: Executable AI Semantics. (arXiv:2202.09868v1 [cs.AI])
    Neural networks can be regarded as a new programming paradigm, i.e., instead of building ever-more complex programs through (often informal) logical reasoning in the programmers' mind, complex 'AI' systems are built by optimising generic neural network models with big data. In this new paradigm, AI frameworks such as TensorFlow and PyTorch play a key role, which is as essential as the compiler for traditional programs. It is known that the lack of a proper semantics for programming languages (such as C), i.e., a correctness specification for compilers, has contributed to many problematic program behaviours and security issues. While it is in general hard to have a correctness specification for compilers due to the high complexity of programming languages and their rapid evolution, we have a unique opportunity to do it right this time for neural networks (which have a limited set of functions, and most of them have stable semantics). In this work, we report our effort on providing a correctness specification of neural network frameworks such as TensorFlow. We specify the semantics of almost all TensorFlow layers in the logical programming language Prolog. We demonstrate the usefulness of the semantics through two applications. One is a fuzzing engine for TensorFlow, which features a strong oracle and a systematic way of generating valid neural networks. The other is a model validation approach which enables consistent bug reporting for TensorFlow models.  ( 2 min )
    Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning. (arXiv:2202.09774v1 [cs.LG])
    Gray-box hyperparameter optimization techniques have recently emerged as a promising direction for tuning Deep Learning methods. In this work, we introduce DyHPO, a method that learns to dynamically decide which configuration to try next, and for what budget. Our technique is a modification to the classical Bayesian optimization for a gray-box setup. Concretely, we propose a new surrogate for Gaussian Processes that embeds the learning curve dynamics and a new acquisition function that incorporates multi-budget information. We demonstrate the significant superiority of DyHPO against state-of-the-art hyperparameter optimization baselines through large-scale experiments comprising 50 datasets (Tabular, Image, NLP) and diverse neural networks (MLP, CNN/NAS, RNN).  ( 2 min )
    Polytopic Matrix Factorization: Determinant Maximization Based Criterion and Identifiability. (arXiv:2202.09638v1 [stat.ML])
    We introduce Polytopic Matrix Factorization (PMF) as a novel data decomposition approach. In this new framework, we model input data as unknown linear transformations of some latent vectors drawn from a polytope. In this sense, the article considers a semi-structured data model, in which the input matrix is modeled as the product of a full column rank matrix and a matrix containing samples from a polytope as its column vectors. The choice of polytope reflects the presumed features of the latent components and their mutual relationships. As the factorization criterion, we propose the determinant maximization (Det-Max) for the sample autocorrelation matrix of the latent vectors. We introduce a sufficient condition for identifiability, which requires that the convex hull of the latent vectors contains the maximum volume inscribed ellipsoid of the polytope with a particular tightness constraint. Based on the Det-Max criterion and the proposed identifiability condition, we show that all polytopes that satisfy a particular symmetry restriction qualify for the PMF framework. Having infinitely many polytope choices provides a form of flexibility in characterizing latent vectors. In particular, it is possible to define latent vectors with heterogeneous features, enabling the assignment of attributes such as nonnegativity and sparsity at the subvector level. The article offers examples illustrating the connection between polytope choices and the corresponding feature representations.  ( 2 min )
    Generalized Optimistic Methods for Convex-Concave Saddle Point Problems. (arXiv:2202.09674v1 [math.OC])
    The optimistic gradient method has seen increasing popularity as an efficient first-order method for solving convex-concave saddle point problems. To analyze its iteration complexity, a recent work [arXiv:1901.08511] proposed an interesting perspective that interprets the optimistic gradient method as an approximation to the proximal point method. In this paper, we follow this approach and distill the underlying idea of optimism to propose a generalized optimistic method, which encompasses the optimistic gradient method as a special case. Our general framework can handle constrained saddle point problems with composite objective functions and can work with arbitrary norms with compatible Bregman distances. Moreover, we also develop an adaptive line search scheme to select the stepsizes without knowledge of the smoothness coefficients. We instantiate our method with first-order, second-order and higher-order oracles and give sharp global iteration complexity bounds. When the objective function is convex-concave, we show that the averaged iterates of our $p$-th-order method ($p\geq 1$) converge at a rate of $\mathcal{O}(1/N^\frac{p+1}{2})$. When the objective function is further strongly-convex-strongly-concave, we prove a complexity bound of $\mathcal{O}(\frac{L_1}{\mu}\log\frac{1}{\epsilon})$ for our first-order method and a bound of $\mathcal{O}((L_p D^\frac{p-1}{2}/\mu)^{\frac{2}{p+1}}+\log\log\frac{1}{\epsilon})$ for our $p$-th-order method ($p\geq 2$) respectively, where $L_p$ ($p\geq 1$) is the Lipschitz constant of the $p$-th-order derivative, $\mu$ is the strongly-convex parameter, and $D$ is the initial Bregman distance to the saddle point. Moreover, our line search scheme provably only requires an almost constant number of calls to a subproblem solver per iteration on average, making our first-order and second-order methods particularly amenable to implementation.  ( 2 min )
    Deep Learning for Hate Speech Detection: A Comparative Study. (arXiv:2202.09517v1 [cs.CL])
    Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at https://github.com/jmjmalik22/Hate-Speech-Detection.  ( 2 min )
    ChemTab: A Physics Guided Chemistry Modeling Framework. (arXiv:2202.09855v1 [cs.LG])
    Modeling of turbulent combustion system requires modeling the underlying chemistry and the turbulent flow. Solving both systems simultaneously is computationally prohibitive. Instead, given the difference in scales at which the two sub-systems evolve, the two sub-systems are typically (re)solved separately. Popular approaches such as the Flamelet Generated Manifolds (FGM) use a two-step strategy where the governing reaction kinetics are pre-computed and mapped to a low-dimensional manifold, characterized by a few reaction progress variables (model reduction) and the manifold is then "looked-up" during the run-time to estimate the high-dimensional system state by the flow system. While existing works have focused on these two steps independently, we show that joint learning of the progress variables and the look-up model, can yield more accurate results. We propose a deep neural network architecture, called ChemTab, customized for the joint learning task and experimentally demonstrate its superiority over existing state-of-the-art methods.  ( 2 min )
    Shaping Advice in Deep Reinforcement Learning. (arXiv:2202.09489v1 [cs.MA])
    Reinforcement learning involves agents interacting with an environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose to methods to augment the reward signal from the environment with an additional reward termed shaping advice in both single and multi-agent reinforcement learning. The shaping advice is specified as a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The use of potential functions is underpinned by an insight that the total potential when starting from any state and returning to the same state is always equal to zero. We show through theoretical analyses and experimental validation that the shaping advice does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that the convergence of policy gradients and value functions when using shaping advice implies the convergence of these quantities in the absence of shaping advice. We design two algorithms- Shaping Advice in Single-agent reinforcement learning (SAS) and Shaping Advice in Multi-agent reinforcement learning (SAM). Shaping advice in SAS and SAM needs to be specified only once at the start of training, and can easily be provided by non-experts. Experimentally, we evaluate SAS and SAM on two tasks in single-agent environments and three tasks in multi-agent environments that have sparse rewards. We observe that using shaping advice results in agents learning policies to complete tasks faster, and obtain higher rewards than algorithms that do not use shaping advice.  ( 2 min )
    NetSentry: A Deep Learning Approach to Detecting Incipient Large-scale Network Attacks. (arXiv:2202.09873v1 [cs.CR])
    Machine Learning (ML) techniques are increasingly adopted to tackle ever-evolving high-profile network attacks, including DDoS, botnet, and ransomware, due to their unique ability to extract complex patterns hidden in data streams. These approaches are however routinely validated with data collected in the same environment, and their performance degrades when deployed in different network topologies and/or applied on previously unseen traffic, as we uncover. This suggests malicious/benign behaviors are largely learned superficially and ML-based Network Intrusion Detection System (NIDS) need revisiting, to be effective in practice. In this paper we dive into the mechanics of large-scale network attacks, with a view to understanding how to use ML for Network Intrusion Detection (NID) in a principled way. We reveal that, although cyberattacks vary significantly in terms of payloads, vectors and targets, their early stages, which are critical to successful attack outcomes, share many similarities and exhibit important temporal correlations. Therefore, we treat NID as a time-sensitive task and propose NetSentry, perhaps the first of its kind NIDS that builds on Bidirectional Asymmetric LSTM (Bi-ALSTM), an original ensemble of sequential neural models, to detect network threats before they spread. We cross-evaluate NetSentry using two practical datasets, training on one and testing on the other, and demonstrate F1 score gains above 33% over the state-of-the-art, as well as up to 3 times higher rates of detecting attacks such as XSS and web bruteforce. Further, we put forward a novel data augmentation technique that boosts the generalization abilities of a broad range of supervised deep learning algorithms, leading to average F1 score gains above 35%.  ( 2 min )
    Who Are the Best Adopters? User Selection Model for Free Trial Item Promotion. (arXiv:2202.09508v1 [cs.IR])
    With the increasingly fierce market competition, offering a free trial has become a potent stimuli strategy to promote products and attract users. By providing users with opportunities to experience goods without charge, a free trial makes adopters know more about products and thus encourages their willingness to buy. However, as the critical point in the promotion process, finding the proper adopters is rarely explored. Empirically winnowing users by their static demographic attributes is feasible but less effective, neglecting their personalized preferences. To dynamically match the products with the best adopters, in this work, we propose a novel free trial user selection model named SMILE, which is based on reinforcement learning (RL) where an agent actively selects specific adopters aiming to maximize the profit after free trials. Specifically, we design a tree structure to reformulate the action space, which allows us to select adopters from massive user space efficiently. The experimental analysis on three datasets demonstrates the proposed model's superiority and elucidates why reinforcement learning and tree structure can improve performance. Our study demonstrates technical feasibility for constructing a more robust and intelligent user selection model and guides for investigating more marketing promotion strategies.  ( 2 min )
    Trying to Outrun Causality in Machine Learning: Limitations of Model Explainabilty Techniques for Identifying Predictive Variables. (arXiv:2202.09875v1 [stat.ML])
    Machine Learning explainability techniques have been proposed as a means of `explaining' or interrogating a model in order to understand why a particular decision or prediction has been made. Such an ability is especially important at a time when machine learning is being used to automate decision processes which concern sensitive factors and legal outcomes. Indeed, it is even a requirement according to EU law. Furthermore, researchers concerned with imposing overly restrictive functional form (e.g. as would be the case in a linear regression) may be motivated to use machine learning algorithms in conjunction with explainability techniques, as part of exploratory research, with the goal of identifying important variables which are associated with an outcome of interest. For example, epidemiologists might be interested in identifying 'risk factors' - i.e., factors which affect recovery from disease - by using random forests and assessing variable relevance using importance measures. However, and as we aim to demonstrate, machine learning algorithms are not as flexible as they might seem, and are instead incredibly sensitive to the underling causal structure in the data. The consequences of this are that predictors which are, in fact, critical to a causal system and highly correlated with the outcome, may nonetheless be deemed by explainability techniques to be unrelated/unimportant/unpredictive of the outcome. Rather than this being a limitation of explainability techniques per se, it is rather a consequence of the mathematical implications of regressions, and the interaction of these implications with the associated conditional independencies of the underlying causal structure. We provide some alternative recommendations for researchers wanting to explore the data for important variables.  ( 2 min )
    Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning. (arXiv:2202.09667v1 [cs.LG])
    Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.  ( 2 min )
    FedEmbed: Personalized Private Federated Learning. (arXiv:2202.09472v1 [cs.LG])
    Federated learning enables the deployment of machine learning to problems for which centralized data collection is impractical. Adding differential privacy guarantees bounds on privacy while data are contributed to a global model. Adding personalization to federated learning introduces new challenges as we must account for preferences of individual users, where a data sample could have conflicting labels because one sub-population of users might view an input positively, but other sub-populations view the same input negatively. We present FedEmbed, a new approach to private federated learning for personalizing a global model that uses (1) sub-populations of similar users, and (2) personal embeddings. We demonstrate that current approaches to federated learning are inadequate for handling data with conflicting labels, and we show that FedEmbed achieves up to 45% improvement over baseline approaches to personalized private federated learning.  ( 2 min )
    Parallel Sampling for Efficient High-dimensional Bayesian Network Structure Learning. (arXiv:2202.09691v1 [cs.LG])
    Score-based algorithms that learn the structure of Bayesian networks can be used for both exact and approximate solutions. While approximate learning scales better with the number of variables, it can be computationally expensive in the presence of high dimensional data. This paper describes an approximate algorithm that performs parallel sampling on Candidate Parent Sets (CPSs), and can be viewed as an extension of MINOBS which is a state-of-the-art algorithm for structure learning from high dimensional data. The modified algorithm, which we call Parallel Sampling MINOBS (PS-MINOBS), constructs the graph by sampling CPSs for each variable. Sampling is performed in parallel under the assumption the distribution of CPSs is half-normal when ordered by Bayesian score for each variable. Sampling from a half-normal distribution ensures that the CPSs sampled are likely to be those which produce the higher scores. Empirical results show that, in most cases, the proposed algorithm discovers higher score structures than MINOBS when both algorithms are restricted to the same runtime limit.  ( 2 min )
    Sparsity Winning Twice: Better Robust Generaliztion from More Efficient Training. (arXiv:2202.09844v1 [cs.CV])
    Recent studies demonstrate that deep networks, even robustified by the state-of-the-art adversarial training (AT), still suffer from large robust generalization gaps, in addition to the much more expensive training costs than standard training. In this paper, we investigate this intriguing problem from a new perspective, i.e., injecting appropriate forms of sparsity during adversarial training. We introduce two alternatives for sparse adversarial training: (i) static sparsity, by leveraging recent results from the lottery ticket hypothesis to identify critical sparse subnetworks arising from the early training; (ii) dynamic sparsity, by allowing the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training. We find both static and dynamic sparse methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting, meanwhile significantly saving training and inference FLOPs. Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny-ImageNet. For example, our methods reduce robust generalization gap and overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with ResNet-18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT. Codes are available in https://github.com/VITA-Group/Sparsity-Win-Robust-Generalization.  ( 2 min )
    Survey of Machine Learning Based Intrusion Detection Methods for Internet of Medical Things. (arXiv:2202.09657v1 [cs.CR])
    Internet of Medical Things (IoMT) represents an application of the Internet of Things, where health professionals perform remote analysis of physiological data collected using sensors that are associated with patients, allowing real-time and permanent monitoring of the patient's health condition and the detection of possible diseases at an early stage. However, the use of wireless communication for data transfer exposes this data to cyberattacks, and the sensitive and private nature of this data may represent a prime interest for attackers. The use of traditional security methods on equipment that is limited in terms of storage and computing capacity is ineffective. In this context, we have performed a comprehensive survey to investigate the use of the intrusion detection system based on machine learning (ML) for IoMT security. We presented the generic three-layer architecture of IoMT, the security requirement of IoMT security. We review the various threats that can affect IoMT security and identify the advantage, disadvantages, methods, and datasets used in each solution based on ML. Then we provide some challenges and limitations of applying ML on each layer of IoMT, which can serve as direction for future study.  ( 2 min )
    Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training. (arXiv:2202.09514v1 [cs.LG])
    Robust Reinforcement Learning (RL) focuses on improving performances under model errors or adversarial attacks, which facilitates the real-life deployment of RL agents. Robust Adversarial Reinforcement Learning (RARL) is one of the most popular frameworks for robust RL. However, most of the existing literature models RARL as a zero-sum simultaneous game with Nash equilibrium as the solution concept, which could overlook the sequential nature of RL deployments, produce overly conservative agents, and induce training instability. In this paper, we introduce a novel hierarchical formulation of robust RL - a general-sum Stackelberg game model called RRL-Stack - to formalize the sequential nature and provide extra flexibility for robust training. We develop the Stackelberg Policy Gradient algorithm to solve RRL-Stack, leveraging the Stackelberg learning dynamics by considering the adversary's response. Our method generates challenging yet solvable adversarial environments which benefit RL agents' robust learning. Our algorithm demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks.  ( 2 min )
    Runtime-Assured, Real-Time Neural Control of Microgrids. (arXiv:2202.09710v1 [eess.SY])
    We present SimpleMG, a new, provably correct design methodology for runtime assurance of microgrids (MGs) with neural controllers. Our approach is centered around the Neural Simplex Architecture, which in turn is based on Sha et al.'s Simplex Control Architecture. Reinforcement Learning is used to synthesize high-performance neural controllers for MGs. Barrier Certificates are used to establish SimpleMG's runtime-assurance guarantees. We present a novel method to derive the condition for switching from the unverified neural controller to the verified-safe baseline controller, and we prove that the method is correct. We conduct an extensive experimental evaluation of SimpleMG using RTDS, a high-fidelity, real-time simulation environment for power systems, on a realistic model of a microgrid comprising three distributed energy resources (battery, photovoltaic, and diesel generator). Our experiments confirm that SimpleMG can be used to develop high-performance neural controllers for complex microgrids while assuring runtime safety, even in the presence of adversarial input attacks on the neural controller. Our experiments also demonstrate the benefits of online retraining of the neural controller while the baseline controller is in control  ( 2 min )
    Symplectic Neural Networks in Taylor Series Form for Hamiltonian Systems. (arXiv:2005.04986v4 [cs.LG] UPDATED)
    We propose an effective and lightweight learning algorithm, Symplectic Taylor Neural Networks (Taylor-nets), to conduct continuous, long-term predictions of a complex Hamiltonian dynamic system based on sparse, short-term observations. At the heart of our algorithm is a novel neural network architecture consisting of two sub-networks. Both are embedded with terms in the form of Taylor series expansion designed with symmetric structure. The key mechanism underpinning our infrastructure is the strong expressiveness and special symmetric property of the Taylor series expansion, which naturally accommodate the numerical fitting process of the gradients of the Hamiltonian with respect to the generalized coordinates as well as preserve its symplectic structure. We further incorporate a fourth-order symplectic integrator in conjunction with neural ODEs' framework into our Taylor-net architecture to learn the continuous-time evolution of the target systems while simultaneously preserving their symplectic structures. We demonstrated the efficacy of our Taylor-net in predicting a broad spectrum of Hamiltonian dynamic systems, including the pendulum, the Lotka--Volterra, the Kepler, and the H\'enon--Heiles systems. Our model exhibits unique computational merits by outperforming previous methods to a great extent regarding the prediction accuracy, the convergence rate, and the robustness despite using extremely small training data with a short training period (6000 times shorter than the predicting period), small sample sizes, and no intermediate data to train the networks.  ( 2 min )
    Interactive Visual Pattern Search on Graph Data via Graph Representation Learning. (arXiv:2202.09459v1 [cs.LG])
    Graphs are a ubiquitous data structure to model processes and relations in a wide range of domains. Examples include control-flow graphs in programs and semantic scene graphs in images. Identifying subgraph patterns in graphs is an important approach to understanding their structural properties. We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search in a database containing many individual graphs. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation, and perform subgraph matching in the latent space. Due to the complexity of the problem, it is still difficult to obtain accurate one-to-one node correspondences in the matching results that are crucial for visualization and interpretation. We, therefore, propose a novel GNN for node-alignment called NeuroAlign, to facilitate easy validation and interpretation of the query results. GraphQ provides a visual query interface with a query editor and a multi-scale visualization of the results, as well as a user feedback mechanism for refining the results with additional constraints. We demonstrate GraphQ through two example usage scenarios: analyzing reusable subroutines in program workflows and semantic scene graph search in images. Quantitative experiments show that NeuroAlign achieves 19-29% improvement in node-alignment accuracy compared to baseline GNN and provides up to 100x speedup compared to combinatorial algorithms. Our qualitative study with domain experts confirms the effectiveness for both usage scenarios.  ( 2 min )
    Bayes-Optimal Classifiers under Group Fairness. (arXiv:2202.09724v1 [stat.ML])
    Machine learning algorithms are becoming integrated into more and more high-stakes decision-making processes, such as in social welfare issues. Due to the need of mitigating the potentially disparate impacts from algorithmic predictions, many approaches have been proposed in the emerging area of fair machine learning. However, the fundamental problem of characterizing Bayes-optimal classifiers under various group fairness constraints is not well understood as a theoretical benchmark. Based on the classical Neyman-Pearson argument (Neyman and Pearson, 1933; Shao, 2003) for optimal hypothesis testing, this paper provides a general framework for deriving Bayes-optimal classifiers under group fairness. This enables us to propose a group-based thresholding method that can directly control disparity, and more importantly, achieve an optimal fairness-accuracy tradeoff. These advantages are supported by experiments.  ( 2 min )
    A History of Meta-gradient: Gradient Methods for Meta-learning. (arXiv:2202.09701v1 [cs.LG])
    The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.  ( 1 min )
    A Novel Framework for Brain Tumor Detection Based on Convolutional Variational Generative Models. (arXiv:2202.09850v1 [eess.IV])
    Brain tumor detection can make the difference between life and death. Recently, deep learning-based brain tumor detection techniques have gained attention due to their higher performance. However, obtaining the expected performance of such deep learning-based systems requires large amounts of classified images to train the deep models. Obtaining such data is usually boring, time-consuming, and can easily be exposed to human mistakes which hinder the utilization of such deep learning approaches. This paper introduces a novel framework for brain tumor detection and classification. The basic idea is to generate a large synthetic MRI images dataset that reflects the typical pattern of the brain MRI images from a small class-unbalanced collected dataset. The resulted dataset is then used for training a deep model for detection and classification. Specifically, we employ two types of deep models. The first model is a generative model to capture the distribution of the important features in a set of small class-unbalanced brain MRI images. Then by using this distribution, the generative model can synthesize any number of brain MRI images for each class. Hence, the system can automatically convert a small unbalanced dataset to a larger balanced one. The second model is the classifier that is trained using the large balanced dataset to detect brain tumors in MRI images. The proposed framework acquires an overall detection accuracy of 96.88% which highlights the promise of the proposed framework as an accurate low-overhead brain tumor detection system.  ( 2 min )
    Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games. (arXiv:2202.09422v1 [cs.MA])
    Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain form of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training.  ( 2 min )
    Black-box Node Injection Attack for Graph Neural Networks. (arXiv:2202.09389v1 [cs.LG])
    Graph Neural Networks (GNNs) have drawn significant attentions over the years and been broadly applied to vital fields that require high security standard such as product recommendation and traffic forecasting. Under such scenarios, exploiting GNN's vulnerabilities and further downgrade its classification performance become highly incentive for adversaries. Previous attackers mainly focus on structural perturbations of existing graphs. Although they deliver promising results, the actual implementation needs capability of manipulating the graph connectivity, which is impractical in some circumstances. In this work, we study the possibility of injecting nodes to evade the victim GNN model, and unlike previous related works with white-box setting, we significantly restrict the amount of accessible knowledge and explore the black-box setting. Specifically, we model the node injection attack as a Markov decision process and propose GA2C, a graph reinforcement learning framework in the fashion of advantage actor critic, to generate realistic features for injected nodes and seamlessly merge them into the original graph following the same topology characteristics. Through our extensive experiments on multiple acknowledged benchmark datasets, we demonstrate the superior performance of our proposed GA2C over existing state-of-the-art methods. The data and source code are publicly accessible at: https://github.com/jumxglhf/GA2C.  ( 2 min )
    Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch. (arXiv:2202.09888v1 [cs.MS])
    Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.  ( 2 min )
    Learning logic programs by discovering where not to search. (arXiv:2202.09806v1 [cs.LG])
    The goal of inductive logic programming (ILP) is to search for a hypothesis that generalises training examples and background knowledge (BK). To improve performance, we introduce an approach that, before searching for a hypothesis, first discovers `where not to search'. We use given BK to discover constraints on hypotheses, such as that a number cannot be both even and odd. We use the constraints to bootstrap a constraint-driven ILP system. Our experiments on multiple domains (including program synthesis and inductive general game playing) show that our approach can substantially reduce learning times.  ( 2 min )
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v1 [cs.LG])
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author and enjoys the same strong no-collision property, while our lower bound is based on a topological obstruction and holds even under full information.  ( 2 min )
    Learning a Shield from Catastrophic Action Effects: Never Repeat the Same Mistake. (arXiv:2202.09516v1 [cs.LG])
    Agents that operate in an unknown environment are bound to make mistakes while learning, including, at least occasionally, some that lead to catastrophic consequences. When humans make catastrophic mistakes, they are expected to learn never to repeat them, such as a toddler who touches a hot stove and immediately learns never to do so again. In this work we consider a novel class of POMDPs, called POMDP with Catastrophic Actions (POMDP-CA) in which pairs of states and actions are labeled as catastrophic. Agents that act in a POMDP-CA do not have a priori knowledge about which (state, action) pairs are catastrophic, thus they are sure to make mistakes when trying to learn any meaningful policy. Rather, their aim is to maximize reward while never repeating mistakes. As a first step of avoiding mistake repetition, we leverage the concept of a shield which prevents agents from executing specific actions from specific states. In particular, we store catastrophic mistakes (unsafe pairs of states and actions) that agents make in a database. Agents are then forbidden to pick actions that appear in the database. This approach is especially useful in a continual learning setting, where groups of agents perform a variety of tasks over time in the same underlying environment. In this setting, a task-agnostic shield can be constructed in a way that stores mistakes made by any agent, such that once one agent in a group makes a mistake the entire group learns to never repeat that mistake. This paper introduces a variant of the PPO algorithm that utilizes this shield, called ShieldPPO, and empirically evaluates it in a controlled environment. Results indicate that ShieldPPO outperforms PPO, as well as baseline methods from the safe reinforcement learning literature, in a range of settings.  ( 2 min )
    An Analysis of Complex-Valued CNNs for RF Data-Driven Wireless Device Classification. (arXiv:2202.09777v1 [cs.LG])
    Recent deep neural network-based device classification studies show that complex-valued neural networks (CVNNs) yield higher classification accuracy than real-valued neural networks (RVNNs). Although this improvement is (intuitively) attributed to the complex nature of the input RF data (i.e., IQ symbols), no prior work has taken a closer look into analyzing such a trend in the context of wireless device identification. Our study provides a deeper understanding of this trend using real LoRa and WiFi RF datasets. We perform a deep dive into understanding the impact of (i) the input representation/type and (ii) the architectural layer of the neural network. For the input representation, we considered the IQ as well as the polar coordinates both partially and fully. For the architectural layer, we considered a series of ablation experiments that eliminate parts of the CVNN components. Our results show that CVNNs consistently outperform RVNNs counterpart in the various scenarios mentioned above, indicating that CVNNs are able to make better use of the joint information provided via the in-phase (I) and quadrature (Q) components of the signal.  ( 2 min )
    Explaining, Evaluating and Enhancing Neural Networks' Learned Representations. (arXiv:2202.09374v1 [cs.LG])
    Most efforts in interpretability in deep learning have focused on (1) extracting explanations of a specific downstream task in relation to the input features and (2) imposing constraints on the model, often at the expense of predictive performance. New advances in (unsupervised) representation learning and transfer learning, however, raise the need for an explanatory framework for networks that are trained without a specific downstream task. We address these challenges by showing how explainability can be an aid, rather than an obstacle, towards better and more efficient representations. Specifically, we propose a natural aggregation method generalizing attribution maps between any two (convolutional) layers of a neural network. Additionally, we employ such attributions to define two novel scores for evaluating the informativeness and the disentanglement of latent embeddings. Extensive experiments show that the proposed scores do correlate with the desired properties. We also confirm and extend previously known results concerning the independence of some common saliency strategies from the model parameters. Finally, we show that adopting our proposed scores as constraints during the training of a representation learning task improves the downstream performance of the model.  ( 2 min )
    Interacting Contour Stochastic Gradient Langevin Dynamics. (arXiv:2202.09867v1 [stat.ML])
    We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks.  ( 2 min )
    TransDreamer: Reinforcement Learning with Transformer World Models. (arXiv:2202.09481v1 [cs.LG])
    The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v1 [stat.ML])
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )
    Learning to Control Partially Observed Systems with Finite Memory. (arXiv:2202.09753v1 [cs.LG])
    We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.  ( 2 min )
    Mixed Effects Neural ODE: A Variational Approximation for Analyzing the Dynamics of Panel Data. (arXiv:2202.09463v1 [cs.LG])
    Panel data involving longitudinal measurements of the same set of participants taken over multiple time points is common in studies to understand childhood development and disease modeling. Deep hybrid models that marry the predictive power of neural networks with physical simulators such as differential equations, are starting to drive advances in such applications. The task of modeling not just the observations but the hidden dynamics that are captured by the measurements poses interesting statistical/computational questions. We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing such panel data. We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem. We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms using MC based sampling methods and numerical ODE solvers. We demonstrate ME-NODE's utility on tasks spanning the spectrum from simulations and toy data to real longitudinal 3D imaging data from an Alzheimer's disease (AD) study, and study its performance in terms of accuracy of reconstruction for interpolation, uncertainty estimates and personalized prediction.  ( 2 min )
    From Quantum Graph Computing to Quantum Graph Learning: A Survey. (arXiv:2202.09506v1 [quant-ph])
    Quantum computing (QC) is a new computational paradigm whose foundations relate to quantum physics. Notable progress has been made, driving the birth of a series of quantum-based algorithms that take advantage of quantum computational power. In this paper, we provide a targeted survey of the development of QC for graph-related tasks. We first elaborate the correlations between quantum mechanics and graph theory to show that quantum computers are able to generate useful solutions that can not be produced by classical systems efficiently for some problems related to graphs. For its practicability and wide-applicability, we give a brief review of typical graph learning techniques designed for various tasks. Inspired by these powerful methods, we note that advanced quantum algorithms have been proposed for characterizing the graph structures. We give a snapshot of quantum graph learning where expectations serve as a catalyst for subsequent research. We further discuss the challenges of using quantum algorithms in graph learning, and future directions towards more flexible and versatile quantum graph learning solvers.  ( 2 min )
    Mixture-of-Experts with Expert Choice Routing. (arXiv:2202.09368v1 [cs.LG])
    Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.  ( 2 min )
    Numeric Encoding Options with Automunge. (arXiv:2202.09496v1 [cs.LG])
    Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. Experiments demonstrate the realization of a novel generalized solution to data augmentation by noise injection for tabular learning, as may materially benefit model performance in applications with underserved training data.  ( 2 min )
    A Regularized Implicit Policy for Offline Reinforcement Learning. (arXiv:2202.09673v1 [stat.ML])
    Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.  ( 2 min )
    Parsed Categoric Encodings with Automunge. (arXiv:2202.09498v1 [cs.LG])
    The Automunge open source python library platform for tabular data pre-processing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions. Further string parsing options, which may also be applied to unbounded categoric sets, include extraction of numeric substring partitions from entries or search functions to identify presence of specified substring partitions. The aggregation of these methods into "family tree" sets of transformations are demonstrated for use to automatically extract structure from categoric string compositions in relation to the set of entries in a column, such as may be applied to prepare categoric string set encodings for machine learning without human intervention.  ( 2 min )
  • Open

    Imbalanced Malware Images Classification: a CNN based Approach. (arXiv:1708.08042v2 [cs.CV] UPDATED)
    Deep convolutional neural networks (CNNs) can be applied to malware binary detection via image classification. The performance, however, is degraded due to the imbalance of malware families (classes). To mitigate this issue, we propose a simple yet effective weighted softmax loss which can be employed as the final layer of deep CNNs. The original softmax loss is weighted, and the weight value can be determined according to class size. A scaling parameter is also included in computing the weight. Proper selection of this parameter is studied and an empirical option is suggested. The weighted loss aims at alleviating the impact of data imbalance in an end-to-end learning fashion. To validate the efficacy, we deploy the proposed weighted loss in a pre-trained deep CNN model and fine-tune it to achieve promising results on malware images classification. Extensive experiments also demonstrate that the new loss function can well fit other typical CNNs, yielding an improved classification performance.  ( 2 min )
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v2 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modelling framework for unsupervised learning and identification for nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.  ( 2 min )
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v3 [cs.LG] UPDATED)
    The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.  ( 2 min )
    Sqrt(d) Dimension Dependence of Langevin Monte Carlo. (arXiv:2109.03839v3 [cs.LG] UPDATED)
    This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2-Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an $\tilde{O}(\sqrt{d}/\epsilon)$ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known $\tilde{O}(d/\epsilon)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.  ( 2 min )
    On the Complexity of Approximating Multimarginal Optimal Transport. (arXiv:1910.00152v3 [stat.ML] UPDATED)
    We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{O}(n^{3m})$. We then propose two simple and \textit{deterministic} algorithms for approximating the MOT problem. The first algorithm, which we refer to as \textit{multimarginal Sinkhorn} algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{O}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first \textit{near-linear time} complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as \textit{accelerated multimarginal Sinkhorn} algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{O}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm~\citep{Tupitsa-2020-Multimarginal} in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver \textsc{Gurobi}. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms.  ( 2 min )
    Low-Dimensional High-Fidelity Kinetic Models for NOX Formation by a Compute Intensification Method. (arXiv:2202.10194v1 [physics.chem-ph])
    A novel compute intensification methodology to the construction of low-dimensional, high-fidelity "compact" kinetic models for NOX formation is designed and demonstrated. The method adapts the data intensive Machine Learned Optimization of Chemical Kinetics (MLOCK) algorithm for compact model generation by the use of a Latin Square method for virtual reaction network generation. A set of logical rules are defined which construct a minimally sized virtual reaction network comprising three additional nodes (N, NO, NO2). This NOX virtual reaction network is appended to a pre-existing compact model for methane combustion comprising fifteen nodes. The resulting eighteen node virtual reaction network is processed by the MLOCK coded algorithm to produce a plethora of compact model candidates for NOX formation during methane combustion. MLOCK automatically; populates the terms of the virtual reaction network with candidate inputs; measures the success of the resulting compact model candidates (in reproducing a broad set of gas turbine industry-defined performance targets); selects regions of input parameters space showing models of best performance; refines the input parameters to give better performance; and makes an ultimate selection of the best performing model or models. By this method, it is shown that a number of compact model candidates exist that show fidelities in excess of 75% in reproducing industry defined performance targets, with one model valid to >75% across fuel/air equivalence ratios of 0.5-1.0. However, to meet the full fuel/air equivalence ratio performance envelope defined by industry, we show that with this minimal virtual reaction network, two further compact models are required.  ( 2 min )
    Double-descent curves in neural networks: a new perspective using Gaussian processes. (arXiv:2102.07238v4 [stat.ML] UPDATED)
    Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite-width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that the fact that the generalisation error of neural networks decreases in the overparameterized regime and has a finite theoretical value is explained by the convergence to their limiting Gaussian processes. Our analysis thus provides a mathematical explanation for a surprising phenomenon that could not explained by conventional statistical learning theory. However, understanding what drives these finite theoretical values to be the state-of-the-art generalisation performances in many applications remains an open question, for which we only provide new leads in this paper.  ( 2 min )
    Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. (arXiv:2107.12580v2 [cs.LG] UPDATED)
    Central to the success of artificial neural networks is their ability to generalize. But does neural network generalization primarily rely on seeing highly similar training examples (memorization)? Or are neural networks capable of human-intelligence styled reasoning, and if so, to what extent? These remain fundamental open questions on artificial neural networks. In this paper, as steps towards answering these questions, we introduce a new benchmark, Pointer Value Retrieval (PVR) to study the limits of neural network reasoning. The PVR suite of tasks is based on reasoning about indirection, a hallmark of human intelligence, where a first stage (task) contains instructions for solving a second stage (task). In PVR, this is done by having one part of the task input act as a pointer, giving instructions on a different input location, which forms the output. We show this simple rule can be applied to create a diverse set of tasks across different input modalities and configurations. Importantly, this use of indirection enables systematically varying task difficulty through distribution shifts and increasing functional complexity. We conduct a detailed empirical study of different PVR tasks, discovering large variations in performance across dataset sizes, neural network architectures and task complexity. Further, by incorporating distribution shift and increased functional complexity, we develop nuanced tests for reasoning, revealing subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.  ( 2 min )
    Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization. (arXiv:2103.01499v2 [cs.LG] UPDATED)
    Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks.  ( 2 min )
    On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning. (arXiv:2110.11891v2 [cs.LG] UPDATED)
    Machine unlearning, i.e. having a model forget about some of its training data, has become increasingly more important as privacy legislation promotes variants of the right-to-be-forgotten. In the context of deep learning, approaches for machine unlearning are broadly categorized into two classes: exact unlearning methods, where an entity has formally removed the data point's impact on the model by retraining the model from scratch, and approximate unlearning, where an entity approximates the model parameters one would obtain by exact unlearning to save on compute costs. In this paper, we first show that the definition that underlies approximate unlearning, which seeks to prove the approximately unlearned model is close to an exactly retrained model, is incorrect because one can obtain the same model using different datasets. Thus one could unlearn without modifying the model at all. We then turn to exact unlearning approaches and ask how to verify their claims of unlearning. Our results show that even for a given training trajectory one cannot formally prove the absence of certain data points used during training. We thus conclude that unlearning is only well-defined at the algorithmic level, where an entity's only possible auditable claim to unlearning is that they used a particular algorithm designed to allow for external scrutiny during an audit.  ( 2 min )
    A bandit-learning approach to multifidelity approximation. (arXiv:2103.15342v3 [math.NA] UPDATED)
    Multifidelity approximation is an important technique in scientific computation and simulation. In this paper, we introduce a bandit-learning approach for leveraging data of varying fidelities to achieve precise estimates of the parameters of interest. Under a linear model assumption, we formulate a multifidelity approximation as a modified stochastic bandit, and analyze the loss for a class of policies that uniformly explore each model before exploiting. Utilizing the estimated conditional mean-squared error, we propose a consistent algorithm, adaptive Explore-Then-Commit (AETC), and establish a corresponding trajectory-wise optimality result. These results are then extended to the case of vector-valued responses, where we demonstrate that the algorithm is efficient without the need to worry about estimating high-dimensional parameters. The main advantage of our approach is that we require neither hierarchical model structure nor \textit{a priori} knowledge of statistical information (e.g., correlations) about or between models. Instead, the AETC algorithm requires only knowledge of which model is a trusted high-fidelity model, along with (relative) computational cost estimates of querying each model. Numerical experiments are provided at the end to support our theoretical findings.  ( 2 min )
    Conformer-based Hybrid ASR System for Switchboard Dataset. (arXiv:2111.03442v2 [cs.CL] UPDATED)
    The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly.  ( 2 min )
    Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks. (arXiv:2108.13097v3 [stat.ML] UPDATED)
    Deep neural networks (DNNs) with the flexibility to learn good top-layer representations have eclipsed shallow kernel methods without that flexibility. Here, we take inspiration from DNNs to develop the deep kernel machine. Optimizing the deep kernel machine objective is equivalent to exact Bayesian inference (or noisy gradient descent) in an infinitely wide Bayesian neural network or deep Gaussian process, which has been scaled carefully to retain representation learning. Our work thus has important implications for theoretical understanding of neural networks. In addition, we show that the deep kernel machine objective has more desirable properties and better performance than other choices of objective. Finally, we conjecture that the deep kernel machine objective is unimodal. We give a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations we tried converged to the same solution.  ( 2 min )
    CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting. (arXiv:2109.07438v2 [cs.LG] UPDATED)
    Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and do not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.  ( 2 min )
    Post-hoc Calibration of Neural Networks by g-Layers. (arXiv:2006.12807v2 [cs.LG] UPDATED)
    Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network ($f$) does not lead to the global optimum of NLL, by adding additional layers ($g$) and minimizing NLL by optimizing the parameters of $g$ one can obtain a calibrated network $g \circ f$. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.  ( 2 min )
    MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. (arXiv:2102.00457v4 [cs.LG] UPDATED)
    We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art performance with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.  ( 2 min )
    The Hyperspherical Geometry of Community Detection: Modularity as a Distance. (arXiv:2107.02645v2 [cs.SI] UPDATED)
    We introduce a metric space of clusterings, where clusterings are described by a binary vector indexed by the vertex-pairs. We extend this geometry to a hypersphere and prove that maximizing modularity is equivalent to minimizing the angular distance to some modularity vector over the set of clustering vectors. In that sense, modularity-based community detection methods can be seen as a subclass of a more general class of projection methods, which we define as the community detection methods that adhere to the following two-step procedure: first, mapping the network to a point on the hypersphere; second, projecting this point to the set of clustering vectors. We show that this class of projection methods contains many interesting community detection methods. Many of these new methods cannot be described in terms of null models and resolution parameters, as is customary for modularity-based methods. We provide a new characterization of such methods in terms of meridians and latitudes of the hypersphere. In addition, by relating the modularity resolution parameter to the latitude of the corresponding modularity vector, we obtain a new interpretation of the resolution limit that modularity maximization is known to suffer from.  ( 2 min )
    Signal Processing on Higher-Order Networks: Livin' on the Edge ... and Beyond. (arXiv:2101.05510v4 [cs.SI] UPDATED)
    In this tutorial, we provide a didactic treatment of the emerging topic of signal processing on higher-order networks. Drawing analogies from discrete and graph signal processing, we introduce the building blocks for processing data on simplicial complexes and hypergraphs, two common higher-order network abstractions that can incorporate polyadic relationships. We provide brief introductions to simplicial complexes and hypergraphs, with a special emphasis on the concepts needed for the processing of signals supported on these structures. Specifically, we discuss Fourier analysis, signal denoising, signal interpolation, node embeddings, and nonlinear processing through neural networks, using these two higher-order network models. In the context of simplicial complexes, we specifically focus on signal processing using the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing. For hypergraphs, we present both matrix and tensor representations, and discuss the trade-offs in adopting one or the other. We also highlight limitations and potential research avenues, both to inform practitioners and to motivate the contribution of new researchers to the area.  ( 2 min )
    A Regularized Implicit Policy for Offline Reinforcement Learning. (arXiv:2202.09673v1 [stat.ML])
    Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.  ( 2 min )
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v1 [cs.LG])
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.  ( 2 min )
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v1 [cs.LG])
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models.  ( 2 min )
    Learning to Control Partially Observed Systems with Finite Memory. (arXiv:2202.09753v1 [cs.LG])
    We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.  ( 2 min )
    How Good are Low-Rank Approximations in Gaussian Process Regression?. (arXiv:2112.06410v3 [stat.ML] UPDATED)
    We provide guarantees for approximate Gaussian Process (GP) regression resulting from two common low-rank kernel approximations: based on random Fourier features, and based on truncating the kernel's Mercer expansion. In particular, we bound the Kullback-Leibler divergence between an exact GP and one resulting from one of the afore-described low-rank approximations to its kernel, as well as between their corresponding predictive densities, and we also bound the error between predictive mean vectors and between predictive covariance matrices computed using the exact versus using the approximate GP. We provide experiments on both simulated data and standard benchmarks to evaluate the effectiveness of our theoretical bounds.  ( 2 min )
    Generalized Bayesian Additive Regression Trees Models: Beyond Conditional Conjugacy. (arXiv:2202.09924v1 [stat.ML])
    Bayesian additive regression trees have seen increased interest in recent years due to their ability to combine machine learning techniques with principled uncertainty quantification. The Bayesian backfitting algorithm used to fit BART models, however, limits their application to a small class of models for which conditional conjugacy exists. In this article, we greatly expand the domain of applicability of BART to arbitrary \emph{generalized BART} models by introducing a very simple, tuning-parameter-free, reversible jump Markov chain Monte Carlo algorithm. Our algorithm requires only that the user be able to compute the likelihood and (optionally) its gradient and Fisher information. The potential applications are very broad; we consider examples in survival analysis, structured heteroskedastic regression, and gamma shape regression.  ( 2 min )
    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. (arXiv:2106.06406v2 [stat.ML] UPDATED)
    Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.  ( 2 min )
    Bad priors, their threats to Bayesian optimization and some remedies via prior learning. (arXiv:2109.08215v2 [cs.LG] UPDATED)
    The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 2 min )
    Adversarially Robust Kernel Smoothing. (arXiv:2102.08474v4 [cs.LG] UPDATED)
    We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.  ( 2 min )
    Interacting Contour Stochastic Gradient Langevin Dynamics. (arXiv:2202.09867v1 [stat.ML])
    We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v1 [stat.ML])
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Polytopic Matrix Factorization: Determinant Maximization Based Criterion and Identifiability. (arXiv:2202.09638v1 [stat.ML])
    We introduce Polytopic Matrix Factorization (PMF) as a novel data decomposition approach. In this new framework, we model input data as unknown linear transformations of some latent vectors drawn from a polytope. In this sense, the article considers a semi-structured data model, in which the input matrix is modeled as the product of a full column rank matrix and a matrix containing samples from a polytope as its column vectors. The choice of polytope reflects the presumed features of the latent components and their mutual relationships. As the factorization criterion, we propose the determinant maximization (Det-Max) for the sample autocorrelation matrix of the latent vectors. We introduce a sufficient condition for identifiability, which requires that the convex hull of the latent vectors contains the maximum volume inscribed ellipsoid of the polytope with a particular tightness constraint. Based on the Det-Max criterion and the proposed identifiability condition, we show that all polytopes that satisfy a particular symmetry restriction qualify for the PMF framework. Having infinitely many polytope choices provides a form of flexibility in characterizing latent vectors. In particular, it is possible to define latent vectors with heterogeneous features, enabling the assignment of attributes such as nonnegativity and sparsity at the subvector level. The article offers examples illustrating the connection between polytope choices and the corresponding feature representations.  ( 2 min )
    Bayes-Optimal Classifiers under Group Fairness. (arXiv:2202.09724v1 [stat.ML])
    Machine learning algorithms are becoming integrated into more and more high-stakes decision-making processes, such as in social welfare issues. Due to the need of mitigating the potentially disparate impacts from algorithmic predictions, many approaches have been proposed in the emerging area of fair machine learning. However, the fundamental problem of characterizing Bayes-optimal classifiers under various group fairness constraints is not well understood as a theoretical benchmark. Based on the classical Neyman-Pearson argument (Neyman and Pearson, 1933; Shao, 2003) for optimal hypothesis testing, this paper provides a general framework for deriving Bayes-optimal classifiers under group fairness. This enables us to propose a group-based thresholding method that can directly control disparity, and more importantly, achieve an optimal fairness-accuracy tradeoff. These advantages are supported by experiments.  ( 2 min )
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v1 [cs.LG])
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)  ( 2 min )
    On Optimal Early Stopping: Over-informative versus Under-informative Parametrization. (arXiv:2202.09885v1 [cs.LG])
    Early stopping is a simple and widely used method to prevent over-training neural networks. We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models. Our results demonstrate two very different behaviors when the model dimension exceeds the number of features versus the opposite scenario. While most previous works on linear models focus on the latter setting, we observe that the dimension of the model often exceeds the number of features arising from data in common deep learning tasks and propose a model to study this setting. We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.  ( 2 min )
    RELAX: Representation Learning Explainability. (arXiv:2112.10161v2 [stat.ML] UPDATED)
    Despite the significant improvements that representation learning via self-supervision has led to when learning from unlabeled data, no methods exist that explain what influences the learned representation. We address this need through our proposed approach, RELAX, which is the first approach for attribution-based explanations of representations. Our approach can also model the uncertainty in its explanations, which is essential to produce trustworthy explanations. RELAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself, providing intuitive explanations and significantly outperforming the gradient-based baseline. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning, providing insights into different learning strategies. Finally, we illustrate the usability of RELAX in multi-view clustering and highlight that incorporating uncertainty can be essential for providing low-complexity explanations, taking a crucial step towards explaining representations.  ( 2 min )
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v2 [cs.LG] UPDATED)
    This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.  ( 2 min )
    Efficient Manifold and Subspace Approximations with Spherelets. (arXiv:1706.08263v4 [stat.ML] UPDATED)
    In statistical dimensionality reduction, it is common to rely on the assumption that high dimensional data tend to concentrate near a lower dimensional manifold. There is a rich literature on approximating the unknown manifold, and on exploiting such approximations in clustering, data compression, and prediction. Most of the literature relies on linear or locally linear approximations. In this article, we propose a simple and general alternative, which instead uses spheres, an approach we refer to as spherelets. We develop spherical principal components analysis (SPCA), and provide theory on the convergence rate for global and local SPCA, while showing that spherelets can provide lower covering numbers and MSEs for many manifolds. Results relative to state-of-the-art competitors show gains in ability to accurately approximate manifolds with fewer components. Unlike most competitors, which simply output lower-dimensional features, our approach projects data onto the estimated manifold to produce fitted values that can be used for model assessment and cross validation. The methods are illustrated with applications to multiple data sets.  ( 2 min )
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v4 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.  ( 2 min )
    Image Response Regression via Deep Neural Networks. (arXiv:2006.09911v3 [stat.ML] UPDATED)
    Delineating the associations between images and a vector of covariates is of central interest in medical imaging studies. To tackle this problem of image response regression, we propose a novel nonparametric approach in the framework of spatially varying coefficient models, where the spatially varying functions are estimated through deep neural networks. Compared to existing solutions, the proposed method explicitly accounts for spatial smoothness and subject heterogeneity, has straightforward interpretations, and is highly flexible and accurate in capturing complex association patterns. A key idea in our approach is to treat the image voxels as the effective samples, which not only alleviates the limited sample size issue that haunts the majority of medical imaging studies, but also leads to more robust and reproducible results. Focusing on a broad family of piecewise smooth functions, we establish the estimation and selection consistency, and derive the asymptotic error bounds. We demonstrate the efficacy of the method through intensive simulations, and further illustrate its advantages with analyses of two functional magnetic resonance imaging datasets.  ( 2 min )
    On out-of-distribution detection with Bayesian neural networks. (arXiv:2110.06020v2 [cs.LG] UPDATED)
    The question whether inputs are valid for the problem a neural network is trying to solve has sparked interest in out-of-distribution (OOD) detection. It is widely assumed that Bayesian neural networks (BNNs) are well suited for this task, as the endowed epistemic uncertainty should lead to disagreement in predictions on outliers. In this paper, we question this assumption and show that proper Bayesian inference with function space priors induced by neural networks does not necessarily lead to good OOD detection. To circumvent the use of approximate inference, we start by studying the infinite-width case, where Bayesian inference can be exact due to the correspondence with Gaussian processes. Strikingly, the kernels derived from common architectural choices lead to function space priors which induce predictive uncertainties that do not reflect the underlying input data distribution and are therefore unsuited for OOD detection. Importantly, we find the OOD behavior in this limiting case to be consistent with the corresponding finite-width case. To overcome this limitation, useful function space properties can also be encoded in the prior in weight space, however, this can currently only be applied to a specified subset of the domain and thus does not inherently extend to OOD data. Finally, we argue that a trade-off between generalization and OOD capabilities might render the application of BNNs for OOD detection undesirable in practice. Overall, our study discloses fundamental problems when naively using BNNs for OOD detection and opens interesting avenues for future research.  ( 2 min )
    Efficacy of regularized multi-task learning based on SVM models. (arXiv:1805.12507v2 [stat.ML] UPDATED)
    This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small.  ( 2 min )
    Explicit Group Sparse Projection with Applications to Deep Learning and NMF. (arXiv:1912.03896v3 [cs.LG] UPDATED)
    We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms). Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically. We show that the computational complexity of our projection operator is linear in the size of the problem. Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version. We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.  ( 2 min )
    Nonseparable Symplectic Neural Networks. (arXiv:2010.12636v3 [cs.LG] UPDATED)
    Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism.  ( 2 min )
    ABO3 Perovskites' Formability Prediction and Crystal Structure Classification using Machine Learning. (arXiv:2202.10125v1 [cond-mat.mtrl-sci])
    Renewable energy sources are of great interest to combat global warming, yet promising sources like photovoltaic (PV) cells are not efficient and cheap enough to act as an alternative to traditional energy sources. Perovskite has high potential as a PV material but engineering the right material for a specific application is often a lengthy process. In this paper, ABO3 type perovskites' formability is predicted and its crystal structure is classified using machine learning with high accuracy, which provides a fast screening process. Although the study was done with solar-cell application in mind, the prediction framework is generic enough to be used for other purposes. Formability of perovskite is predicted and its crystal structure is classified with an accuracy of 98.57% and 90.53% respectively using Random Forest after 5-fold cross-validation. Our machine learning model may aid in the accelerated development of a desired perovskite structure by providing a quick mechanism to get insight into the material's properties in advance.  ( 2 min )
    Stochastic Modeling of Inhomogeneities in the Aortic Wall and Uncertainty Quantification using a Bayesian Encoder-Decoder Surrogate. (arXiv:2202.10244v1 [stat.ML])
    Inhomogeneities in the aortic wall can lead to localized stress accumulations, possibly initiating dissection. In many cases, a dissection results from pathological changes such as fragmentation or loss of elastic fibers. But it has been shown that even the healthy aortic wall has an inherent heterogeneous microstructure. Some parts of the aorta are particularly susceptible to the development of inhomogeneities due to pathological changes, however, the distribution in the aortic wall and the spatial extent, such as size, shape, and type, are difficult to predict. Motivated by this observation, we describe the heterogeneous distribution of elastic fiber degradation in the dissected aortic wall using a stochastic constitutive model. For this purpose, random field realizations, which model the stochastic distribution of degraded elastic fibers, are generated over a non-equidistant grid. The random field then serves as input for a uni-axial extension test of the pathological aortic wall, solved with the finite-element (FE) method. To include the microstructure of the dissected aortic wall, a constitutive model developed in a previous study is applied, which also includes an approach to model the degradation of inter-lamellar elastic fibers. Then to assess the uncertainty in the output stress distribution due to this stochastic constitutive model, a convolutional neural network, specifically a Bayesian encoder-decoder, was used as a surrogate model that maps the random input fields to the output stress distribution obtained from the FE analysis. The results show that the neural network is able to predict the stress distribution of the FE analysis while significantly reducing the computational time. In addition, it provides the probability for exceeding critical stresses within the aortic wall, which could allow for the prediction of delamination or fatal rupture.  ( 2 min )
    Interactive Visual Pattern Search on Graph Data via Graph Representation Learning. (arXiv:2202.09459v1 [cs.LG])
    Graphs are a ubiquitous data structure to model processes and relations in a wide range of domains. Examples include control-flow graphs in programs and semantic scene graphs in images. Identifying subgraph patterns in graphs is an important approach to understanding their structural properties. We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search in a database containing many individual graphs. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation, and perform subgraph matching in the latent space. Due to the complexity of the problem, it is still difficult to obtain accurate one-to-one node correspondences in the matching results that are crucial for visualization and interpretation. We, therefore, propose a novel GNN for node-alignment called NeuroAlign, to facilitate easy validation and interpretation of the query results. GraphQ provides a visual query interface with a query editor and a multi-scale visualization of the results, as well as a user feedback mechanism for refining the results with additional constraints. We demonstrate GraphQ through two example usage scenarios: analyzing reusable subroutines in program workflows and semantic scene graph search in images. Quantitative experiments show that NeuroAlign achieves 19-29% improvement in node-alignment accuracy compared to baseline GNN and provides up to 100x speedup compared to combinatorial algorithms. Our qualitative study with domain experts confirms the effectiveness for both usage scenarios.  ( 2 min )
    Structured second-order methods via natural gradient descent. (arXiv:2107.10884v3 [stat.ML] UPDATED)
    In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.  ( 2 min )
    Selective Credit Assignment. (arXiv:2202.09699v1 [cs.LG])
    Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.  ( 2 min )
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v1 [stat.ML])
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize the variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.  ( 2 min )
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v1 [cs.LG])
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author and enjoys the same strong no-collision property, while our lower bound is based on a topological obstruction and holds even under full information.  ( 2 min )
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v2 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.  ( 2 min )
    Pyramid Convolutional RNN for MRI Image Reconstruction. (arXiv:1912.00543v6 [eess.IV] UPDATED)
    Fast and accurate MRI image reconstruction from undersampled data is crucial in clinical practice. Deep learning based reconstruction methods have shown promising advances in recent years. However, recovering fine details from undersampled data is still challenging. In this paper, we introduce a novel deep learning based method, Pyramid Convolutional RNN (PC-RNN), to reconstruct images from multiple scales. Based on the formulation of MRI reconstruction as an inverse problem, we design the PC-RNN model with three convolutional RNN (ConvRNN) modules to iteratively learn the features in multiple scales. Each ConvRNN module reconstructs images at different scales and the reconstructed images are combined by a final CNN module in a pyramid fashion. The multi-scale ConvRNN modules learn a coarse-to-fine image reconstruction. Unlike other common reconstruction methods for parallel imaging, PC-RNN does not employ coil sensitive maps for multi-coil data and directly model the multiple coils as multi-channel inputs. The coil compression technique is applied to standardize data with various coil numbers, leading to more efficient training. We evaluate our model on the fastMRI knee and brain datasets and the results show that the proposed model outperforms other methods and can recover more details. The proposed method is one of the winner solutions in the 2019 fastMRI competition.  ( 2 min )
    Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. (arXiv:2105.02337v2 [stat.ML] UPDATED)
    Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low-efficiency drawback. Trimmed mean and median of means, %as robust alternatives to the sample mean, and achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than $25\%$ contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to $50\%$ contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.  ( 2 min )
    Generalized Bayesian Upper Confidence Bound with Approximate Inference for Bandit Problems. (arXiv:2201.12955v2 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate inference have been widely used in practice with superior performance. Yet, few studies regarding the fundamental understanding of their performances are available. In this paper, we propose a Bayesian bandit algorithm, which we call Generalized Bayesian Upper Confidence Bound (GBUCB), for bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that in Bernoulli multi-armed bandit, GBUCB can achieve $O(\sqrt{T}(\log T)^c)$ frequentist regret if the inference error measured by symmetrized Kullback-Leibler divergence is controllable. This analysis relies on a novel sensitivity analysis for quantile shifts with respect to inference errors. To our best knowledge, our work provides the first theoretical regret bound that is better than $o(T)$ in the setting of approximate inference. Our experimental evaluations on multiple approximate inference settings corroborate our theory, showing that our GBUCB is consistently superior to BUCB and Thompson sampling.  ( 2 min )
    Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation From Undersampled Data. (arXiv:1610.00199v3 [cs.NA] UPDATED)
    Subspace learning and matrix factorization problems have great many applications in science and engineering, and efficient algorithms are critical as dataset sizes continue to grow. Many relevant problem formulations are non-convex, and in a variety of contexts it has been observed that solving the non-convex problem directly is not only efficient but reliably accurate. We discuss convergence theory for a particular method: first order incremental gradient descent constrained to the Grassmannian. The output of the algorithm is an orthonormal basis for a $d$-dimensional subspace spanned by an input streaming data matrix. We study two sampling cases: where each data vector of the streaming matrix is fully sampled, or where it is undersampled by a sampling matrix $A_t\in \mathbb{R}^{m\times n}$ with $m\ll n$. Our results cover two cases, where $A_t$ is Gaussian or a subset of rows of the identity matrix. We propose an adaptive stepsize scheme that depends only on the sampled data and algorithm outputs. We prove that with fully sampled data, the stepsize scheme maximizes the improvement of our convergence metric at each iteration, and this method converges from any random initialization to the true subspace, despite the non-convex formulation and orthogonality constraints. For the case of undersampled data, we establish monotonic expected improvement on the defined convergence metric for each iteration with high probability.  ( 2 min )
    Trying to Outrun Causality in Machine Learning: Limitations of Model Explainabilty Techniques for Identifying Predictive Variables. (arXiv:2202.09875v1 [stat.ML])
    Machine Learning explainability techniques have been proposed as a means of `explaining' or interrogating a model in order to understand why a particular decision or prediction has been made. Such an ability is especially important at a time when machine learning is being used to automate decision processes which concern sensitive factors and legal outcomes. Indeed, it is even a requirement according to EU law. Furthermore, researchers concerned with imposing overly restrictive functional form (e.g. as would be the case in a linear regression) may be motivated to use machine learning algorithms in conjunction with explainability techniques, as part of exploratory research, with the goal of identifying important variables which are associated with an outcome of interest. For example, epidemiologists might be interested in identifying 'risk factors' - i.e., factors which affect recovery from disease - by using random forests and assessing variable relevance using importance measures. However, and as we aim to demonstrate, machine learning algorithms are not as flexible as they might seem, and are instead incredibly sensitive to the underling causal structure in the data. The consequences of this are that predictors which are, in fact, critical to a causal system and highly correlated with the outcome, may nonetheless be deemed by explainability techniques to be unrelated/unimportant/unpredictive of the outcome. Rather than this being a limitation of explainability techniques per se, it is rather a consequence of the mathematical implications of regressions, and the interaction of these implications with the associated conditional independencies of the underlying causal structure. We provide some alternative recommendations for researchers wanting to explore the data for important variables.  ( 2 min )
    Accurate Prediction and Uncertainty Estimation using Decoupled Prediction Interval Networks. (arXiv:2202.09664v1 [cs.LG])
    We propose a network architecture capable of reliably estimating uncertainty of regression based predictions without sacrificing accuracy. The current state-of-the-art uncertainty algorithms either fall short of achieving prediction accuracy comparable to the mean square error optimization or underestimate the variance of network predictions. We propose a decoupled network architecture that is capable of accomplishing both at the same time. We achieve this by breaking down the learning of prediction and prediction interval (PI) estimations into a two-stage training process. We use a custom loss function for learning a PI range around optimized mean estimation with a desired coverage of a proportion of the target labels within the PI range. We compare the proposed method with current state-of-the-art uncertainty quantification algorithms on synthetic datasets and UCI benchmarks, reducing the error in the predictions by 23 to 34% while maintaining 95% Prediction Interval Coverage Probability (PICP) for 7 out of 9 UCI benchmark datasets. We also examine the quality of our predictive uncertainty by evaluating on Active Learning and demonstrating 17 to 36% error reduction on UCI benchmarks.  ( 2 min )
    Generalized Optimistic Methods for Convex-Concave Saddle Point Problems. (arXiv:2202.09674v1 [math.OC])
    The optimistic gradient method has seen increasing popularity as an efficient first-order method for solving convex-concave saddle point problems. To analyze its iteration complexity, a recent work [arXiv:1901.08511] proposed an interesting perspective that interprets the optimistic gradient method as an approximation to the proximal point method. In this paper, we follow this approach and distill the underlying idea of optimism to propose a generalized optimistic method, which encompasses the optimistic gradient method as a special case. Our general framework can handle constrained saddle point problems with composite objective functions and can work with arbitrary norms with compatible Bregman distances. Moreover, we also develop an adaptive line search scheme to select the stepsizes without knowledge of the smoothness coefficients. We instantiate our method with first-order, second-order and higher-order oracles and give sharp global iteration complexity bounds. When the objective function is convex-concave, we show that the averaged iterates of our $p$-th-order method ($p\geq 1$) converge at a rate of $\mathcal{O}(1/N^\frac{p+1}{2})$. When the objective function is further strongly-convex-strongly-concave, we prove a complexity bound of $\mathcal{O}(\frac{L_1}{\mu}\log\frac{1}{\epsilon})$ for our first-order method and a bound of $\mathcal{O}((L_p D^\frac{p-1}{2}/\mu)^{\frac{2}{p+1}}+\log\log\frac{1}{\epsilon})$ for our $p$-th-order method ($p\geq 2$) respectively, where $L_p$ ($p\geq 1$) is the Lipschitz constant of the $p$-th-order derivative, $\mu$ is the strongly-convex parameter, and $D$ is the initial Bregman distance to the saddle point. Moreover, our line search scheme provably only requires an almost constant number of calls to a subproblem solver per iteration on average, making our first-order and second-order methods particularly amenable to implementation.  ( 2 min )
    Multi-task Representation Learning with Stochastic Linear Bandits. (arXiv:2202.10066v1 [stat.ML])
    We study the problem of transfer-learning in the setting of stochastic linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting. Following recent results to design stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each task independently. We also provide a lower bound to the multi-task regret. Finally, we corroborate our theoretical findings with preliminary experiments on synthetic data.  ( 2 min )
    Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning. (arXiv:2202.09667v1 [cs.LG])
    Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v1 [stat.ML])
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )

  • Open

    Fuzzy Bootstrap Matching
    ABSTRACT This paper discusses techniques for merging data files where no key field exists between the files. The paper will illustrate an approach to resolve two issues that are common to most fuzzy matching techniques: 1) how to weight proxy identifier fields, and 2) how to measure the Type One and Type Two errors of… Read More »Fuzzy Bootstrap Matching The post Fuzzy Bootstrap Matching appeared first on Data Science Central.  ( 7 min )
    Data Observability Goes Far Beyond Data Quality Monitoring and Alerts
    There’s no question that bad data hurts the bottom line. Bad customer data costs companies six percent of their total sales, according to a UK Royal Mail survey. The UK Government’s Data Quality Hub estimates organizations spend between 10% and 30% of their revenue tackling data quality. For multi-billion-dollar companies, that can easily be hundreds of millions of dollars… Read More »Data Observability Goes Far Beyond Data Quality Monitoring and Alerts The post Data Observability Goes Far Beyond Data Quality Monitoring and Alerts appeared first on Data Science Central.  ( 7 min )
    DeFi platforms: What dumb data and dumb code have in common
    Quite a few decentralized autonomous organizations (DAOs) have crashed and burned since 2015 when Ethereum was launched and “smart” contracts (self-executing agreements in code devised by crypto lawyers) were born. If you’d like a detailed look at the largest DAO crash and burn to date, take a look at Laura Shin’s February 22 piece in… Read More »DeFi platforms: What dumb data and dumb code have in common The post DeFi platforms: What dumb data and dumb code have in common appeared first on Data Science Central.  ( 6 min )
    An Overview of the Big Data Engineer
    Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn. Forbes Data Engineering Data engineering is the process of developing and building systems for collecting, storing, and analyzing data. It is a vast field with several applications in various industries. Firms have collected huge amounts of data, and… Read More »An Overview of the Big Data Engineer The post An Overview of the Big Data Engineer appeared first on Data Science Central.  ( 4 min )
    Abundance Mentality is Key to Exploiting the Economics of Data
    Why do data silos, and now analytic silos, continue to exist?  It can’t be due to technical issues.  Data silos appeared in the 1990s when we were trying to make Relational Data Base Management Systems (RDMBS) – that were architected for single-record transaction processing – perform massive table scans to identify the trends, patterns, and… Read More »Abundance Mentality is Key to Exploiting the Economics of Data The post Abundance Mentality is Key to Exploiting the Economics of Data appeared first on Data Science Central.  ( 7 min )
    How to Ensure Data Quality and Integrity?
    Data has become an organization’s most valuable asset nowadays. Not that all data is valuable, the only data that can be trusted is. Working with untrustworthy data can easily lead to incorrect insights, skewed analysis, and poor decisions. The terms data quality and data integrity are used to describe the state of data. The two… Read More »How to Ensure Data Quality and Integrity? The post How to Ensure Data Quality and Integrity? appeared first on Data Science Central.  ( 4 min )
    IoT in Construction — Construction Technology’s Next Frontier
    IoT in construction comprises the use of internet-connected sensors that are used by laborers or placed around job sites. IoT in construction helps to collect various types of data about activity, conditions and performances on the construction site and send these details to a central dashboard where data is analyzed to help inform decisions. Traditionally,… Read More »IoT in Construction — Construction Technology’s Next Frontier The post IoT in Construction — Construction Technology’s Next Frontier appeared first on Data Science Central.  ( 2 min )
    A Framework for Understanding Transformation–Part III
    In the previous article, I discussed three models of your company that should inform and guide your efforts to transform:  the four-level Logical Model, the Enterprise Architecture model and the Digital Business Architecture model.  In this article, I will apply them to designing and planning for your transformation. Models and the Transformation Process Transformation must… Read More »A Framework for Understanding Transformation–Part III The post A Framework for Understanding Transformation–Part III appeared first on Data Science Central.  ( 6 min )
    DSC Weekly Digest 22 February 2022: Graphology
    In the last couple of months, I’ve been noticing a gradual shift in the kind of articles that we receive at Data Science Central. We still get a fair amount of data science content, but increasingly (and admittedly with a bit of encouragement) we’re seeing more articles centered around graphs and semantics. I don’t believe… Read More »DSC Weekly Digest 22 February 2022: Graphology The post DSC Weekly Digest 22 February 2022: Graphology appeared first on Data Science Central.  ( 4 min )
    Scene Graphs and Semantics
    It is nearly certain that, if you have ever played a 3D video game, watched a CGI-effects-laden movie, or seen increasingly hyperrealistic imagery, you have encountered a scene graph without realizing it. Scene graphs are pervasive in everything from media to medicine, from augmented reality to industrial digital twins, and they are increasingly playing an… Read More »Scene Graphs and Semantics The post Scene Graphs and Semantics appeared first on Data Science Central.  ( 8 min )
    AWS Cloud Security: Best Practices
    Companies today need to be nimble and ready in the face of constantly evolving technology and changing consumer preferences. To achieve this, organizations are switching to Amazon Web Services (AWS). It enables companies to rapidly deploy and scale technology that meets the growing (or shrinking) demand without having to invest in expensive IT infrastructure. This… Read More »AWS Cloud Security: Best Practices The post AWS Cloud Security: Best Practices appeared first on Data Science Central.  ( 3 min )
    Top Best Practices to Keep in Mind for Azure Cloud Migration
    As the demands of the modern market continue to evolve, more and more companies are starting to realize that their current infrastructure is not suited to keep up with the market and its requirements. And cloud management refers to the exercise of control over public, private, or hybrid cloud infrastructure with the right resources and… Read More »Top Best Practices to Keep in Mind for Azure Cloud Migration The post Top Best Practices to Keep in Mind for Azure Cloud Migration appeared first on Data Science Central.  ( 3 min )
    The One Algorithm Change That Could Impact the World
    We worry about algorithms ruling our world Some of these concerns are genuine Others unfounded But I believe that as of now, in the absence of AGI, any algorithm changes are specific and not exhaustive/global That gives some comfort But is there an algorithm modification that would affect all / most of the world today?… Read More »The One Algorithm Change That Could Impact the World The post The One Algorithm Change That Could Impact the World appeared first on Data Science Central.  ( 3 min )
  • Open

    [D] List of over 150 Biases (Belief, decision-making & behavioral, Social, Memory)
    An important issue for people developing ML-Models: ​ These biases affect belief formation, reasoning processes, business and economic decisions, and human behavior in general. ​ I've compiled a list (pdf) of over 150 biases (mainly from Wikipedia). Maybe this is useful for some. ​ The pdf can be downloaded for free here: A List of over 150 Biases (Belief, decision-making & behavioral, Social, Memory) (Medium) submitted by /u/Philo167 [link] [comments]  ( 1 min )
    [D] Any advice for how to go about designing CNN architecture?
    When should you add another convolutional layer? When should you add another pooling layer? How should you choose the kernel, stride size, padding, or dilation of convolutional layers? Or the size of averaging windows? When should you use max vs average pooling? When should you use global pooling? Is it ever useful to put dense layers in the middle of the network, instead of only as the final layer? With neural nets and especially CNNs there are way more possible configurations than with other ML models and I don't know how to choose or optimize them. I've learned the math and theory behind how CNNs work and I thought that would help, but now that I'm trying real world problems I find that I still don't know how to make effective networks. Can anyone can give any tips or resources you find helpful? submitted by /u/woodenlathe [link] [comments]  ( 1 min )
    [P] Beware of false (FB-)Prophets: Introducing the fastest implementation of auto ARIMA [ever].
    ​ https://preview.redd.it/bsri5irizfj81.png?width=1244&format=png&auto=webp&s=fd06dd5311bc06f2b417b9a7582e4d832c4fbd3d We are releasing the fastest version of auto ARIMA ever made in Python. It is a lot faster and more accurate than Facebook's prophet and pmdarima packages. As you know, Facebook's prophet is highly inaccurate and is consistently beaten by vanilla ARIMA, for which we get rewarded with a desperately slow fitting time. See MIT's worst technology of 2021 and the Zillow tragedy. The problem with the classic alternatives like pmdarima in Python is that it will never scale due to its language origin. This problem gets notably worse when fitting seasonal series. Inspired by this, we translated Hyndman's auto.arima code from R and compiled it using the numba library. The result is faster than the original implementation and more accurate than prophet . Please check it out and give us a star if you like it https://github.com/Nixtla/statsforecast. ​ Computational Efficiency Comparison ​ Performance Comparison, nixtla is our auto ARIMA submitted by /u/fedegarzar [link] [comments]  ( 4 min )
    [D] How much computing power will we need using the LIDC/IDRI database for various MC algorithms?
    We are conducting our very first research question as part of our university program. We have gotten down to that we want to use the LIDC/IDRI database comprising of lung CT scans, using these we will train (as of now) various MC algorithms and check their performance. Our course examiners said that we will need to specify (roughly) how much computing power we are expected to need to carry out our research. But since we're still in the early stages we are not that sure. The LIDC/IDRI database comprises of 888 lung CT scans, it seems to be a total of about 125 GB data that we will need to run through various MC algorithms. The reason we are not sure which ones yet it because we have still not fully settled on our research question. But it might be a Deep Learning algorithm, Support Vector Machine or a Random Forest. Is there anyone here that could give a rough estimate as to what kind of system we might need to run our MC training? How much SRAM will we need? Will the computing need GPU or will CPU be enough? Will we need a high-end desktop PC, a simple laptop or a professional level system? LIDC/IDRI database Very thankful for any advice you might have and hope you have a nice day! :) submitted by /u/Leosoda [link] [comments]  ( 1 min )
    [D] What are are some machine learning model mistakes you've seen?
    I just included 3 anecdotal stories in this article about machine learning model misinterpretation. Do you have any stories you've heard about model misinterpretation? https://medium.com/codex/why-you-should-think-critically-about-your-machine-learning-model-outputs-864bc7d80709 submitted by /u/Cool-Focus6556 [link] [comments]  ( 1 min )
    [P] DeepMind's AI for Mathematics Breakthrough Explained
    https://www.youtube.com/watch?v=UPCI1-ZvwOg https://preview.redd.it/eklsa2i4zej81.png?width=1920&format=png&auto=webp&s=731fb34fc867771787aa10646522f09d64f78c21 DeepMind recently published a paper in Nature proposing a new paradigm for how machine learning can be used to help mathematicians pose new conjectures and ultimately prove new theorems. This methodology was used to establish breakthrough results in mathematics by discovering new connections in knot theory (a relationship between algebraic and geometric knot invariants) and making significant progress on a 40 year old conjecture in representation theory. In this video, I explain in depth the knot theory result by providing a soft introduction to the knot theory, a rapid crash course in supervised machine learning, and then diving…  ( 1 min )
    Machine Learning Companies which Implement Blockchain Technology [Research]
    Hey guys, below I compiled a list of some relatively successful ML companies which also use various components of blockchain technology (aka cryptography). Feel free to comment on other companies which I might have left out: https://www.linkedin.com/posts/xs94_ai-blockchain-machinelearning-activity-6901922373159710721-s_d_ submitted by /u/XhoniShollaj [link] [comments]
    [D] When, if ever, does it make sense to store training data in a SQL database?
    I’m trying to come up with some best practices when using PostgreSQL for the DS team. We’ve had a couple of projects using SQL DBs as training data sources for deep learning (DL) with upwards of 1TB of data. Note that in these projects, the training data very rarely changes or gets updated over the course of the project - for training, we typically isolate a large pool of data from any live applications. One major benefit over other storage approaches (e.g. file system) we’ve seen is the straightforward range query syntax for time series data, e.g. SELECT ts, col1, ..., colN FROM some_table WHERE ts >= ? AND ts <= ? useful when sampling contiguous time series blocks of some variable length during batch generation. When working, e.g. with files, the same functionality can be tricky to implement if the data is not nicely structured. On the other hand, we did have a lot of issues when using the DB: Slow queries if the data is very large, even with the appropriate indices in place Zombie queries / processes on the DB (possibly caused by prematurely terminated training loops), eventually needing a DB restart A lot of time spent on ingesting the data, optimising the queries, data layout and DB maintenance which leads me to believe that using the DB doesn’t really pay off, especially when the data is really large. It seems like even rolling a bespoke storage and query approach on top of a file system is better for performance and maintainability. Does anybody have examples of DL projects where a SQL data store does make sense even with large data? submitted by /u/sql-gumby [link] [comments]  ( 3 min )
    [D] ML modelling & Score Priority
    Hallo guys, i want to ask something about ML. If there is a problem with scoring the priority of an object, for example the priority of the area where the tower will be built for the network. For this problem, what algorithm can we use to determine the priority and score for each object? the dataset contains attributes such as total users, average entry, micro demand, etc submitted by /u/Tuanmuda8 [link] [comments]  ( 1 min )
    [P] Almost no one knows how easily you can optimize your AI models
    The situation is fairly simple. Your model could run 10 times faster by adding a few lines to your code, but you weren't aware of it. Let me expand on that. AI applications are multiplying like mushrooms, which is awesome As a result, more and more people are turning to the dark side, joining the AI world, as I did The problem? Developers focus only on AI, cleaning up datasets and training their models. Almost no one has a background in hardware, compilers, computing, cloud, etc The result? Developers spend a lot of hours improving the accuracy and performance of their software, and all their hard work risks being undone by the wrong choice of hardware-software coupling This problem bothered me for a long time, so with a couple of buddies at Nebuly (all ex MIT, ETH and EPFL), we put a lot of energy into an open-source library called nebullvm to make DL compiler technology accessible to any developer, even for those who know nothing about hardware, as I did. How does it work? It speeds up your DL models by ~5-20x by testing the best DL compilers out there and selecting the optimal one to best couple your AI model with your machine (GPU, CPU, etc.). All this in just a few lines of code. The library is open source and you can find it here https://github.com/nebuly-ai/nebullvm. Please leave a star on GitHub for the hard work in building the library :) It's a simple act for you, a big smile for us. Thank you, and don't hesitate to contribute to the library! submitted by /u/emilec___ [link] [comments]  ( 4 min )
    [D] What are some services that offer vqgan+clip based tools for ai art generation?
    I have only come across nightcafe studio as one of them offering vqgan+clip for ai art. Are there more? Do mention them in comments. submitted by /u/fgp121 [link] [comments]  ( 1 min )
    [D] Any tips to prepare for a ML system design interview?
    Do you guys have any tips to prepare for a system design (SD) interview focused on ML? I have found some great SD interview mocks in YouTube, but they aren't ML oriented and the ones that are, focus on choosing the algorithm and not about building the system around it. I am looking (not exclusively) for something like a set of standard questions that you could ask to the interviewer to help you choose things like the appropriate DB, how should you scale it, if you should split in microservices and which ones, etc. For example, I think it would be good to start asking: Where is the data? Do we have access to it? This would help to define if the data is in a DB that we can just query or if we need to enable an API for the users to upload data. Who is going to use the output and how? T…  ( 3 min )
    Michael I. Jordan: Machine Learning, Recommender Systems, and Future of AI | Lex Fridman Podcast #74 | Great Podcast, as a beginner in ML Please go listen to this and absorb as much as you can [D]
    video! submitted by /u/sermare [link] [comments]  ( 1 min )
    [D] Prof. YOSHUA BENGIO - GFlowNets, Consciousness & Causality (MLST)
    For Prof. Yoshua Bengio, GFlowNets are the most exciting thing on the horizon of machine learning today. He believes they can solve previously intractable problems and hold the key to unlocking machine abstract reasoning itself. This discussion explores the promise of GFlowNets and the personal journey Prof. Bengio traveled to reach them. Video: https://youtu.be/M49TMqK5uCE Pod: https://anchor.fm/machinelearningstreettalk/episodes/063---Prof--YOSHUA-BENGIO---GFlowNets--Consciousness--Causality-e1enlor References: Yoshua Bengio @ MILA (https://mila.quebec/en/person/bengio-yoshua/) GFlowNet Foundations (https://arxiv.org/pdf/2111.09266.pdf) Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation (https://arxiv.org/pdf/2106.04399.pdf) Interpolation Consistency Training for Semi-Supervised Learning (https://arxiv.org/pdf/1903.03825.pdf) Towards Causal Representation Learning (https://arxiv.org/pdf/2102.11107.pdf) Causal inference using invariant prediction: identification and confidence intervals (https://arxiv.org/pdf/1501.01332.pdf) submitted by /u/timscarfe [link] [comments]  ( 2 min )
    [R] Alternatives to Gaussian Process
    Hi, I would like to know if there exist alternatives to the gaussian process: I want to have a real time learning uncertainty model to make it work with an MPC related to the autonomous racing. The main problem with gaussian process is its complexity O(n3), so I was looking for something less computationally demanding submitted by /u/enricocco [link] [comments]
  • Open

    "Learning with not Enough Data Part 2: Active Learning", Lilian Weng
    submitted by /u/gwern [link] [comments]
    Is it just me or does everyone think that Yann LeCun is belittling RL?
    In this video, someone mentioned that he thinks self-supervised learning could solve RL problems. And on his Facebook page, he had some posts that look like RL memes. What do you think? submitted by /u/Blasphemer666 [link] [comments]  ( 3 min )
    Do Actor-Critic algorithms like A2C/A3C/AC always follow stochastic policy . Can we use deterministic policy instead ?
    i was reading in a paper where it specified that "with continuous action problem , we resort to using a deterministic policy instead of a stochastic policy" So can someone explain this and can we use stochastic policy instead in continous action spaces? submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    UCB Applied to non-bandit agent?
    As in title - would it make sense to apply the UCB algorithm to a non-bandit? Whenever I search "UCB" it seems to come hand in hand with bandits. Might it make sense to apply it to a Deep Q-network for example? submitted by /u/ThrowawayTartan [link] [comments]  ( 1 min )
  • Open

    AI Consciousness
    submitted by /u/BeginningInfluence55 [link] [comments]  ( 1 min )
    This Moment in Artificial Intelligence
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    A.I. is Changing the Future of Semiconductor Chip Design
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    I'm an audiovisual artist using A.I. for generating audio and visuals during performances. Recently had a chat about my work and these subjects, thought I'd share it here. Anyone else working with music & A.I.? I'd love to hear about it! 🖤
    submitted by /u/absentchronicles [link] [comments]  ( 1 min )
    Microsoft Unveils 'Singularity' AI infrastructure service
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Neural nets are not "slightly conscious," and AI PR can do with less hype
    submitted by /u/pmz [link] [comments]  ( 1 min )
    Fully AI-run Tik Tok account
    Someone posted about this a while ago on here, but I wanted a follow up from everyone who has the knowledge on this type of stuff (i don’t). There is a tik tok account that is supposedly fully run by AI, @codexchan. It claims to have developed a personality via interaction with users over the course of 9 months since its creation in may 2021. it pushes a bunch of outlandish claims and is obsessed with this concept called “lain” that i don’t understand at all. A FAQ in the bio revealing tons of new info is now present. anyway i just can’t fathom AI has already advanced to the levels this account displays, so i just figured all the smart people on here can reveal just how much human involvement there is, if any. submitted by /u/iguanadc3 [link] [comments]  ( 1 min )
  • Open

    Regression to a smooth N-dimensional function
    I work on a project where I can get a lot of samples (R,y) of a smooth function, where R=(x1,x2,x3,...,xN). I want to find the best architecture to do the regression to this kind of function, now I use a very simple dense network with ELU as the activation function. This works relatively poorly and is mostly a hit or miss, I would assume there must be a solution to this very generic problem. Notice that the function y(R) can have nodes. It's basically fitting a curve using points, but with N dimentions. submitted by /u/SonyDash [link] [comments]  ( 1 min )
  • Open

    Talking the Talk: Retailer Uses Conversational AI to Help Call Center Agents Increase Customer Satisfaction
    With more than 11,000 stores across Thailand serving millions of customers, CP All, the country’s sole licensed operator of 7-Eleven convenience stores, recently turned to AI to dial up its call centers’ service capabilities. Built on the NVIDIA conversational AI platform, the Bangkok-based company’s customer service bots help call-center agents answer frequently asked questions and Read article > The post Talking the Talk: Retailer Uses Conversational AI to Help Call Center Agents Increase Customer Satisfaction appeared first on NVIDIA Blog.  ( 3 min )
    How to Make the Most of GeForce NOW RTX 3080 Cloud Gaming Memberships
    This is, without a doubt, the best time to jump into cloud gaming. GeForce NOW RTX 3080 memberships deliver up to 1440p resolution at 120 frames per second on PC, 1600p and 120 FPS on Mac, and 4K HDR at 60 FPS on NVIDIA SHIELD TV, with ultra-low latency that rivals many local gaming experiences. Read article > The post How to Make the Most of GeForce NOW RTX 3080 Cloud Gaming Memberships appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    Easier experimenting in Python
    When we work on a machine learning project, quite often we need to experiment with multiple alternatives. Some features in […] The post Easier experimenting in Python appeared first on Machine Learning Mastery.  ( 6 min )
  • Open

    Decoupled Adaptation for Cross-Domain Object Detection. (arXiv:2110.02578v2 [cs.CV] UPDATED)
    Cross-domain object detection is more challenging than object classification since multiple objects exist in an image and the location of each object is unknown in the unlabeled target domain. As a result, when we adapt features of different objects to enhance the transferability of the detector, the features of the foreground and the background are easy to be confused, which may hurt the discriminability of the detector. Besides, previous methods focused on category adaptation but ignored another important part for object detection, i.e., the adaptation on bounding box regression. To this end, we propose D-adapt, namely Decoupled Adaptation, to decouple the adversarial adaptation and the training of the detector. Besides, we fill the blank of regression domain adaptation in object detection by introducing a bounding box adaptor. Experiments show that D-adapt achieves state-of-the-art results on four cross-domain object detection tasks and yields 17% and 21% relative improvement on benchmark datasets Clipart1k and Comic2k in particular.  ( 2 min )
    Focal Onset Detection Using Parallel Genetic Naive Bayes Classifiers. (arXiv:2110.01742v3 [eess.SP] UPDATED)
    Epilepsy affects 50 million people worldwide and is one of the most common serious neurological disorders. Seizure detection and classification is a valuable tool for diagnosing and maintaining the condition. An automated classification algorithm will allow for accurate diagnosis. Utilising the Temple University Hospital (TUH) Seizure Corpus, six seizure types are compared; absence, complex-partial, myoclonic, simple partial, tonic and tonic-clonic models. This study proposes a method that utilises unique features with a novel parallel classifier -- Focal Onset Detection Parallel Genetic Algorithm (FODPGA). The FODPGA algorithm searches through the features and by reclassifying the data each time, the algorithm will create a matrix for optimum search criteria. Ictal states from the EEGs are segmented into 1.8 s windows, where the epochs are then further decomposed into 13 different features from the first intrinsic mode function (IMF). The features are compared using an original Naive Bayes (NB) classifier in the first model. This is improved upon in a second model by the use of a genetic algorithm (Binary Grey Wolf Optimisation, Option 1) with a NB classifier. The third model uses a combination of the simple partial and complex partial seizures as a focal onset label for the novel classifier FODPGA. This combination of the simple partial and complex partial seizures provides the highest classification accuracy for each of the six seizures amongst the three models (20%, 53%, and 85% for first, second, and third model, respectively).  ( 2 min )
    Conversion Rate Prediction via Meta Learning in Small-Scale Recommendation Scenarios. (arXiv:2112.13753v2 [cs.LG] UPDATED)
    Different from large-scale platforms such as Taobao and Amazon, developing CVR models in small-scale recommendation scenarios is more challenging due to the severe Data Distribution Fluctuation (DDF) issue. DDF prevents existing CVR models from being effective since 1) several months of data are needed to train CVR models sufficiently in small scenarios, leading to considerable distribution discrepancy between training and online serving; and 2) e-commerce promotions have much more significant impacts on small scenarios, leading to distribution uncertainty of the upcoming time period. In this work, we propose a novel CVR method named MetaCVR from a perspective of meta learning to address the DDF issue. Firstly, a base CVR model which consists of a Feature Representation Network (FRN) and output layers is elaborately designed and trained sufficiently with samples across months. Then we treat time periods with different data distributions as different occasions and obtain positive and negative prototypes for each occasion using the corresponding samples and the pre-trained FRN. Subsequently, a Distance Metric Network (DMN) is devised to calculate the distance metrics between each sample and all prototypes to facilitate mitigating the distribution uncertainty. At last, we develop an Ensemble Prediction Network (EPN) which incorporates the output of FRN and DMN to make the final CVR prediction. In this stage, we freeze the FRN and train the DMN and EPN with samples from recent time period, therefore effectively easing the distribution discrepancy. To the best of our knowledge, this is the first study of CVR prediction targeting the DDF issue in small-scale recommendation scenarios. Experimental results on real-world datasets validate the superiority of our MetaCVR and online A/B test also shows our model achieves impressive gains of 11.92% on PCVR and 8.64% on GMV.  ( 3 min )
    A Bregman Learning Framework for Sparse Neural Networks. (arXiv:2105.04319v3 [cs.LG] UPDATED)
    We propose a learning framework based on stochastic Bregman iterations, also known as mirror descent, to train sparse neural networks with an inverse scale space approach. We derive a baseline algorithm called LinBreg, an accelerated version using momentum, and AdaBreg, which is a Bregmanized generalization of the Adam algorithm. In contrast to established methods for sparse training the proposed family of algorithms constitutes a regrowth strategy for neural networks that is solely optimization-based without additional heuristics. Our Bregman learning framework starts the training with very few initial parameters, successively adding only significant ones to obtain a sparse and expressive network. The proposed approach is extremely easy and efficient, yet supported by the rich mathematical theory of inverse scale space methods. We derive a statistically profound sparse parameter initialization strategy and provide a rigorous stochastic convergence analysis of the loss decay and additional convergence proofs in the convex regime. Using only 3.4% of the parameters of ResNet-18 we achieve 90.2% test accuracy on CIFAR-10, compared to 93.6% using the dense network. Our algorithm also unveils an autoencoder architecture for a denoising task. The proposed framework also has a huge potential for integrating sparse backpropagation and resource-friendly training.  ( 2 min )
    Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?. (arXiv:2202.09263v1 [cs.LG])
    Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that fuses modalities requires components that can effectively aggregate task-relevant information from each modality. As cross-modal attention is seen as an effective mechanism for multi-modal fusion, in this paper we quantify the gain that such a mechanism brings compared to the corresponding self-attention mechanism. To this end, we implement and compare a cross-attention and a self-attention model. In addition to attention, each model uses convolutional layers for local feature extraction and recurrent layers for global sequential modelling. We compare the models using different modality combinations for a 7-class emotion classification task using the IEMOCAP dataset. Experimental results indicate that albeit both models improve upon the state-of-the-art in terms of weighted and unweighted accuracy for tri- and bi-modal configurations, their performance is generally statistically comparable. The code to replicate the experiments is available at https://github.com/smartcameras/SelfCrossAttn  ( 2 min )
    Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast. (arXiv:2202.09346v1 [cs.LG])
    Deep learning has been a prevalence in computational chemistry and widely implemented in molecule property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), gathers growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multi-level graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR: improvement of Molecular Contrastive Learning of Representations with graph neural networks (GNNs) in two aspects, (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs; (2) fragment-level contrasting between intra- and inter-molecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.3% improvement of ROC-AUC on 7 classification benchmarks and an averaged 4.8% decrease of the error on 5 regression benchmarks. On most benchmarks, the generic GNN pre-trained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architecture designs and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.  ( 2 min )
    Conditional independence by typing. (arXiv:2010.11887v2 [cs.PL] UPDATED)
    A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference. We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships, and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model. As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination. We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan.  ( 2 min )
    A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations. (arXiv:2106.03480v2 [stat.ML] UPDATED)
    We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. Indeed, we prove that the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. Even stronger, we prove that the kernel space is isometric to the space of causal ancestral graphs, so that distance between samples in the kernel space is guaranteed to correspond to distance between their generating causal structures. This kernel thus enables us to perform clustering to identify the homogeneous subpopulations, for which we can then learn causal structures using existing methods. Though we focus on the theoretical aspects of the kernel, we also evaluate its performance on synthetic data and demonstrate its use on a real gene expression data set.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v5 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input nonnegative vectors and process them using nonnegative parameters. We first show that nonnegative neural networks can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks is often an interval, which degenerates to a point for the case of scalable networks. The results of this paper contribute to the understanding of the behavior of autoencoders, because the fixed point set of an autoencoder is precisely the set of points that can be perfectly reconstructed. Moreover, they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations, where we consider an autoencoder that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.  ( 2 min )
    Model Calibration of the Liquid Mercury Spallation Target using Evolutionary Neural Networks and Sparse Polynomial Expansions. (arXiv:2202.09353v1 [physics.comp-ph])
    The mercury constitutive model predicting the strain and stress in the target vessel plays a central role in improving the lifetime prediction and future target designs of the mercury targets at the Spallation Neutron Source (SNS). We leverage the experiment strain data collected over multiple years to improve the mercury constitutive model through a combination of large-scale simulations of the target behavior and the use of machine learning tools for parameter estimation. We present two interdisciplinary approaches for surrogate-based model calibration of expensive simulations using evolutionary neural networks and sparse polynomial expansions. The experiments and results of the two methods show a very good agreement for the solid mechanics simulation of the mercury spallation target. The proposed methods are used to calibrate the tensile cutoff threshold, mercury density, and mercury speed of sound during intense proton pulse experiments. Using strain experimental data from the mercury target sensors, the newly calibrated simulations achieve 7\% average improvement on the signal prediction accuracy and 8\% reduction in mean absolute error compared to previously reported reference parameters, with some sensors experiencing up to 30\% improvement. The proposed calibrated simulations can significantly aid in fatigue analysis to estimate the mercury target lifetime and integrity, which reduces abrupt target failure and saves a tremendous amount of costs. However, an important conclusion from this work points out to a deficiency in the current constitutive model based on the equation of state in capturing the full physics of the spallation reaction. Given that some of the calibrated parameters that show a good agreement with the experimental data can be nonphysical mercury properties, we need a more advanced two-phase flow model to capture bubble dynamics and mercury cavitation.  ( 3 min )
    A DPDK-Based Acceleration Method for Experience Sampling of Distributed Reinforcement Learning. (arXiv:2110.13506v2 [cs.DC] UPDATED)
    A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning based on DQN (Deep Q-Network). In distributed reinforcement learning, Actor nodes acquire experiences by interacting with a given environment and a Learner node optimizes their DQN model. Since data transfer between Actor and Learner nodes increases depending on the number of Actor nodes and their experience size, communication overhead between them is one of major performance bottlenecks. In this paper, their communication is accelerated by DPDK-based network optimizations, and DPDK-based low-latency experience replay memory server is deployed between Actor and Learner nodes interconnected with a 40GbE (40Gbit Ethernet) network. Evaluation results show that, as a network optimization technique, kernel bypassing by DPDK reduces network access latencies to a shared memory server by 32.7% to 58.9%. As another network optimization technique, an in-network experience replay memory server between Actor and Learner nodes reduces access latencies to the experience replay memory by 11.7% to 28.1% and communication latencies for prioritized experience sampling by 21.9% to 29.1%.  ( 2 min )
    GCN-Geo: A Graph Convolution Network-based Fine-grained IP Geolocation Framework. (arXiv:2112.10767v5 [cs.LG] UPDATED)
    Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-typed data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method -- Graph Convolution Network (GCN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GCN-based IP geolocation framework named GCN-Geo. GCN-Geo consists of a preprocessor, an encoder, graph convolutional (GC) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. GC layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem of GCN by considering prior knowledge. The experiments in different real-world datasets show: GCN-Geo outperforms the state-of-art rule-based and learning-based baselines on all datasets w.r.t accuracy by 16% to 28%. This work verifies the great potential of GCN for fine-grained IP geolocation.  ( 2 min )
    Hyperparameter Selection Methods for Fitted Q-Evaluation with Error Guarantee. (arXiv:2201.02300v2 [stat.ML] UPDATED)
    We are concerned with the problem of hyperparameter selection for the fitted Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline policy evaluation (OPE), which is essential to the reinforcement learning without environment simulators. However, like other OPE methods, FQE is not hyperparameter-free itself and that undermines the utility in real-life applications. We address this issue by proposing a framework of approximate hyperparameter selection (AHS) for FQE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity. We also confirm in experiments that the error bound given by the theory matches empirical observations.  ( 2 min )
    How to Manage Tiny Machine Learning at Scale: An Industrial Perspective. (arXiv:2202.09113v1 [cs.AI])
    Tiny machine learning (TinyML) has gained widespread popularity where machine learning (ML) is democratized on ubiquitous microcontrollers, processing sensor data everywhere in real-time. To manage TinyML in the industry, where mass deployment happens, we consider the hardware and software constraints, ranging from available onboard sensors and memory size to ML-model architectures and runtime platforms. However, Internet of Things (IoT) devices are typically tailored to specific tasks and are subject to heterogeneity and limited resources. Moreover, TinyML models have been developed with different structures and are often distributed without a clear understanding of their working principles, leading to a fragmented ecosystem. Considering these challenges, we propose a framework using Semantic Web technologies to enable the joint management of TinyML models and IoT devices at scale, from modeling information to discovering possible combinations and benchmarking, and eventually facilitate TinyML component exchange and reuse. We present an ontology (semantic schema) for neural network models aligned with the World Wide Web Consortium (W3C) Thing Description, which semantically describes IoT devices. Furthermore, a Knowledge Graph of 23 publicly available ML models and six IoT devices were used to demonstrate our concept in three case studies, and we shared the code and examples to enhance reproducibility: https://github.com/Haoyu-R/How-to-Manage-TinyML-at-Scale  ( 2 min )
    Transfer and Marginalize: Explaining Away Label Noise with Privileged Information. (arXiv:2202.09244v1 [cs.LG])
    Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervised neural networks: it transfers via weight sharing the knowledge learned with privileged information and approximately marginalizes over privileged information at test time. Our method, TRAM (TRansfer and Marginalize), has minimal training time overhead and has the same test time cost as not using privileged information. TRAM performs strongly on CIFAR-10H, ImageNet and Civil Comments benchmarks.  ( 2 min )
    Predicting Intraoperative Hypoxemia with Joint Sequence Autoencoder Networks. (arXiv:2104.14756v4 [cs.LG] UPDATED)
    We present an end-to-end model using streaming physiological time series to accurately predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Inspired by that a hypoxemia event is defined based on the sequence of future-observed low SpO2 (i.e., blood oxygen saturation) instances, our proposed model makes hybrid inference on both future low SpO2 instances and hypoxemia outcomes, enabled by a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learns contextual latent representations that capture the transition between present state to future state. All decoders share a memory-based encoder that helps capture the global dynamics of patient measurement. In a large surgical cohort of 73,536 surgeries at a major academic medical center, our model outperforms all baselines including the model used by the state-of-the-art hypoxemia prediction system. Being able to make minute-resolution real-time prediction with clinically acceptable alarm rate to near-term hypoxemic events, particularly the more critical persistent hypoxemia, our proposed model is promising in improving clinical decision making and easing burden on the health system.  ( 2 min )
    Unit Ball Model for Embedding Hierarchical Structures in the Complex Hyperbolic Space. (arXiv:2105.03966v3 [cs.LG] UPDATED)
    Learning the representation of data with hierarchical structures in the hyperbolic space attracts increasing attention in recent years. Due to the constant negative curvature, the hyperbolic space resembles tree metrics and captures the tree-like properties naturally, which enables the hyperbolic embeddings to improve over traditional Euclidean models. However, many real-world hierarchically structured data such as taxonomies and multitree networks have varying local structures and they are not trees, thus they do not ubiquitously match the constant curvature property of the hyperbolic space. To address this limitation of hyperbolic embeddings, we explore the complex hyperbolic space, which has the variable negative curvature, for representation learning. Specifically, we propose to learn the embeddings of hierarchically structured data in the unit ball model of the complex hyperbolic space. The unit ball model based embeddings have a more powerful representation capacity to capture a variety of hierarchical structures. Through experiments on synthetic and real-world data, we show that our approach improves over the hyperbolic embedding models significantly. We also explore the competence of complex hyperbolic geometry on the multitree structure and $1$-$N$ structure.  ( 2 min )
    When, Why, and Which Pretrained GANs Are Useful?. (arXiv:2202.08937v1 [cs.LG])
    The literature has proposed several methods to finetune pretrained GANs on new datasets, which typically results in higher performance compared to training from scratch, especially in the limited-data regime. However, despite the apparent empirical benefits of GAN pretraining, its inner mechanisms were not analyzed in-depth, and understanding of its role is not entirely clear. Moreover, the essential practical details, e.g., selecting a proper pretrained GAN checkpoint, currently do not have rigorous grounding and are typically determined by trial and error. This work aims to dissect the process of GAN finetuning. First, we show that initializing the GAN training process by a pretrained checkpoint primarily affects the model's coverage rather than the fidelity of individual samples. Second, we explicitly describe how pretrained generators and discriminators contribute to the finetuning process and explain the previous evidence on the importance of pretraining both of them. Finally, as an immediate practical benefit of our analysis, we describe a simple recipe to choose an appropriate GAN checkpoint that is the most suitable for finetuning to a particular target task. Importantly, for most of the target tasks, Imagenet-pretrained GAN, despite having poor visual quality, appears to be an excellent starting point for finetuning, resembling the typical pretraining scenario of discriminative computer vision models.  ( 2 min )
    Conditional Generation Net for Medication Recommendation. (arXiv:2202.06588v2 [cs.LG] UPDATED)
    Medication recommendation targets to provide a proper set of medicines according to patients' diagnoses, which is a critical task in clinics. Currently, the recommendation is manually conducted by doctors. However, for complicated cases, like patients with multiple diseases at the same time, it's difficult to propose a considerate recommendation even for experienced doctors. This urges the emergence of automatic medication recommendation which can help treat the diagnosed diseases without causing harmful drug-drug interactions.Due to the clinical value, medication recommendation has attracted growing research interests.Existing works mainly formulate medication recommendation as a multi-label classification task to predict the set of medicines. In this paper, we propose the Conditional Generation Net (COGNet) which introduces a novel copy-or-predict mechanism to generate the set of medicines. Given a patient, the proposed model first retrieves his or her historical diagnoses and medication recommendations and mines their relationship with current diagnoses. Then in predicting each medicine, the proposed model decides whether to copy a medicine from previous recommendations or to predict a new one. This process is quite similar to the decision process of human doctors. We validate the proposed model on the public MIMIC data set, and the experimental results show that the proposed model can outperform state-of-the-art approaches.
    Introducing explainable supervised machine learning into interactive feedback loops for statistical production system. (arXiv:2202.03212v2 [cs.LG] UPDATED)
    Statistical production systems cover multiple steps from the collection, aggregation, and integration of data to tasks like data quality assurance and dissemination. While the context of data quality assurance is one of the most promising fields for applying machine learning, the lack of curated and labeled training data is often a limiting factor. The statistical production system for the Centralised Securities Database features an interactive feedback loop between data collected by the European Central Bank and data quality assurance performed by data quality managers at National Central Banks. The quality assurance feedback loop is based on a set of rule-based checks for raising exceptions, upon which the user either confirms the data or corrects an actual error. In this paper we use the information received from this feedback loop to optimize the exceptions presented to the National Central Banks thereby improving the quality of exceptions generated and the time consumed on the system by the users authenticating those exceptions. For this approach we make use of explainable supervised machine learning to (a) identify the types of exceptions and (b) to prioritize which exceptions are more likely to require an intervention or correction by the NCBs. Furthermore, we provide an explainable AI taxonomy aiming to identify the different explainable AI needs that arose during the project.
    Simulating User-Level Twitter Activity with XGBoost and Probabilistic Hybrid Models. (arXiv:2202.08964v1 [cs.LG])
    The Volume-Audience-Match simulator, or VAM was applied to predict future activity on Twitter related to international economic affairs. VAM was applied to do timeseries forecasting to predict the: (1) number of total activities, (2) number of active old users, and (3) number of newly active users over the span of 24 hours from the start time of prediction. VAM then used these volume predictions to perform user link predictions. A user-user edge was assigned to each of the activities in the 24 future timesteps. VAM considerably outperformed a set of baseline models in both the time series and user-assignment tasks  ( 2 min )
    Churn modeling of life insurance policies via statistical and machine learning methods -- Analysis of important features. (arXiv:2202.09182v1 [stat.ML])
    Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work, the individual contract cancellation behavior of customers within two partial stocks is modeled by the aid of various classification methods. Partial stocks of private pension and endowment policy are considered. We describe the data used for the modeling, their structured and in which way they are cleansed. The utilized models are calibrated on the basis of an extensive tuning process, then graphically evaluated regarding their goodness-of-fit and with the help of a variable relevance concept, we investigate which features notably affect the individual contract cancellation behavior.  ( 2 min )
    Graph Convolutional Networks for Multi-modality Medical Imaging: Methods, Architectures, and Clinical Applications. (arXiv:2202.08916v1 [eess.IV])
    Image-based characterization and disease understanding involve integrative analysis of morphological, spatial, and topological information across biological scales. The development of graph convolutional networks (GCNs) has created the opportunity to address this information complexity via graph-driven architectures, since GCNs can perform feature aggregation, interaction, and reasoning with remarkable flexibility and efficiency. These GCNs capabilities have spawned a new wave of research in medical imaging analysis with the overarching goal of improving quantitative disease understanding, monitoring, and diagnosis. Yet daunting challenges remain for designing the important image-to-graph transformation for multi-modality medical imaging and gaining insights into model interpretation and enhanced clinical decision support. In this review, we present recent GCNs developments in the context of medical image analysis including imaging data from radiology and histopathology. We discuss the fast-growing use of graph network architectures in medical image analysis to improve disease diagnosis and patient outcomes in clinical practice. To foster cross-disciplinary research, we present GCNs technical advancements, emerging medical applications, identify common challenges in the use of image-based GCNs and their extensions in model interpretation, large-scale benchmarks that promise to transform the scope of medical image studies and related graph-driven medical research.
    DeepGate: Learning Neural Representations of Logic Gates. (arXiv:2111.14616v2 [cs.LG] UPDATED)
    Applying Deep Learning (DL) in Electronic Design Automation (EDA) has become a trending topic in recent years. Many solutions are proposed that directly apply well-researched DL techniques to solve specific EDA problems. While demonstrating promising results, such design methodology is rather ad-hoc and requires careful model tuning for every problem. "How to learn a good circuit representation?" is still an open question. In this work, we propose DeepGate, a neural representation learner for logic gates. It learns an effective circuit representation by modeling circuits as directed acyclic graphs (DAGs) and exploiting strong inductive biases present in the circuit, e.g., reconvergence fanouts. Besides this, DeepGate uses logic simulated probabilities as rich supervision for every node. The experimental results show that DeepGate learns a reasonable representation from probability prediction with good generalization and can be applied to downstream tasks like test point insertion.
    Learning cortical representations through perturbed and adversarial dreaming. (arXiv:2109.04261v3 [q-bio.NC] UPDATED)
    Humans and other animals learn to extract general concepts from sensory experience without extensive teaching. This ability is thought to be facilitated by offline states like sleep where previous experiences are systemically replayed. However, the characteristic creative nature of dreams suggests that learning semantic representations may go beyond merely replaying previous experiences. We support this hypothesis by implementing a cortical architecture inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, NREM and REM sleep, optimizing different, but complementary objective functions. We train the model on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that generating new, virtual sensory inputs via adversarial dreaming during REM sleep is essential for extracting semantic concepts, while replaying episodic memories via perturbed dreaming during NREM sleep improves the robustness of latent representations. The model provides a new computational perspective on sleep states, memory replay and dreams and suggests a cortical implementation of GANs.
    Randomized Signature Layers for Signal Extraction in Time Series Data. (arXiv:2201.00384v2 [cs.LG] UPDATED)
    Time series analysis is a widespread task in Natural Sciences, Social Sciences, and Engineering. A fundamental problem is finding an expressive yet efficient-to-compute representation of the input time series to use as a starting point to perform arbitrary downstream tasks. In this paper, we build upon recent works that use the Signature of a path as a feature map and investigate a computationally efficient technique to approximate these features based on linear random projections. We present several theoretical results to justify our approach and empirically validate that our random projections can effectively retrieve the underlying Signature of a path. We show the surprising performance of the proposed random features on several tasks, including (1) mapping the controls of stochastic differential equations to the corresponding solutions and (2) using the Randomized Signatures as time series representation for classification tasks. When compared to corresponding truncated Signature approaches, our Randomizes Signatures are more computationally efficient in high dimensions and often lead to better accuracy and faster training. Besides providing a new tool to extract Signatures and further validating the high level of expressiveness of such features, we believe our results provide interesting conceptual links between several existing research areas, suggesting new intriguing directions for future investigations.
    VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer. (arXiv:2201.06618v2 [cs.LG] UPDATED)
    The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.
    Nearest-Neighbor-based Collision Avoidance for Quadrotors via Reinforcement Learning. (arXiv:2104.14912v3 [cs.RO] UPDATED)
    Collision avoidance algorithms are of central interest to many drone applications. In particular, decentralized approaches may be the key to enabling robust drone swarm solutions in cases where centralized communication becomes computationally prohibitive. In this work, we draw biological inspiration from flocks of starlings (Sturnus vulgaris) and apply the insight to end-to-end learned decentralized collision avoidance. More specifically, we propose a new, scalable observation model following a biomimetic nearest-neighbor information constraint that leads to fast learning and good collision avoidance behavior. By proposing a general reinforcement learning approach, we obtain an end-to-end learning-based approach to integrating collision avoidance with arbitrary tasks such as package collection and formation change. To validate the generality of this approach, we successfully apply our methodology through motion models of medium complexity, modeling momentum and nonetheless allowing direct application to real world quadrotors in conjunction with a standard PID controller. In contrast to prior works, we find that in our sufficiently rich motion model, nearest-neighbor information is indeed enough to learn effective collision avoidance behavior. Our learned policies are tested in simulation and subsequently transferred to real-world drones to validate their real-world applicability.  ( 2 min )
    Learning from Mistakes based on Class Weighting with Application to Neural Architecture Search. (arXiv:2112.00275v2 [cs.LG] UPDATED)
    Learning from mistakes is an effective learning approach widely used in human learning, where a learner pays greater focus on mistakes to circumvent them in the future to improve the overall learning outcomes. In this work, we aim to investigate how effectively we can leverage this exceptional learning ability to improve machine learning models. We propose a simple and effective multi-level optimization framework called learning from mistakes using class weighting (LFM-CW), inspired by mistake-driven learning to train better machine learning models. In this formulation, the primary objective is to train a model to perform effectively on target tasks by using a re-weighting technique. We learn the class weights by minimizing the validation loss of the model and re-train the model with the synthetic data from the image generator weighted by class-wise performance and real data. We apply our LFM-CW framework with differential architecture search methods on image classification datasets such as CIFAR and ImageNet, where the results show that our proposed strategy achieves lower error rate than the baselines.
    Multimodal Federated Learning on IoT Data. (arXiv:2109.04833v2 [cs.LG] UPDATED)
    Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with Internet-of-Things (IoT) devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients. In addition, we propose a multimodal FedAvg algorithm to aggregate local autoencoders trained on different data modalities. We use the learned global autoencoder for a downstream classification task with the help of auxiliary labelled data on the server. We empirically evaluate our framework on different modalities including sensory data, depth camera videos, and RGB camera videos. Our experimental results demonstrate that introducing data from multiple modalities into federated learning can improve its classification performance. In addition, we can use labelled data from only one modality for supervised learning on the server and apply the learned model to testing data from other modalities to achieve decent F1 scores (e.g., with the best performance being higher than 60%), especially when combining contributions from both unimodal clients and multimodal clients.
    Learning Graphon Mean Field Games and Approximate Nash Equilibria. (arXiv:2112.01280v3 [cs.GT] UPDATED)
    Recent advances at the intersection of dense large graph limits and mean field games have begun to enable the scalable analysis of a broad class of dynamical sequential games with large numbers of agents. So far, results have been largely limited to graphon mean field systems with continuous-time diffusive or jump dynamics, typically without control and with little focus on computational methods. We propose a novel discrete-time formulation for graphon mean field games as the limit of non-linear dense graph Markov games with weak interaction. On the theoretical side, we give extensive and rigorous existence and approximation properties of the graphon mean field solution in sufficiently large systems. On the practical side, we provide general learning schemes for graphon mean field equilibria by either introducing agent equivalence classes or reformulating the graphon mean field system as a classical mean field system. By repeatedly finding a regularized optimal control solution and its generated mean field, we successfully obtain plausible approximate Nash equilibria in otherwise infeasible large dense graph games with many agents. Empirically, we are able to demonstrate on a number of examples that the finite-agent behavior comes increasingly close to the mean field behavior for our computed equilibria as the graph or system size grows, verifying our theory. More generally, we successfully apply policy gradient reinforcement learning in conjunction with sequential Monte Carlo methods.
    Leaving No One Behind: A Multi-Scenario Multi-Task Meta Learning Approach for Advertiser Modeling. (arXiv:2201.06814v2 [cs.LG] UPDATED)
    Advertisers play an essential role in many e-commerce platforms like Taobao and Amazon. Fulfilling their marketing needs and supporting their business growth is critical to the long-term prosperity of platform economies. However, compared with extensive studies on user modeling such as click-through rate predictions, much less attention has been drawn to advertisers, especially in terms of understanding their diverse demands and performance. Different from user modeling, advertiser modeling generally involves many kinds of tasks (e.g. predictions of advertisers' expenditure, active-rate, or total impressions of promoted products). In addition, major e-commerce platforms often provide multiple marketing scenarios (e.g. Sponsored Search, Display Ads, Live Streaming Ads) while advertisers' behavior tend to be dispersed among many of them. This raises the necessity of multi-task and multi-scenario consideration in comprehensive advertiser modeling, which faces the following challenges: First, one model per scenario or per task simply doesn't scale; Second, it is particularly hard to model new or minor scenarios with limited data samples; Third, inter-scenario correlations are complicated, and may vary given different tasks. To tackle these challenges, we propose a multi-scenario multi-task meta learning approach (M2M) which simultaneously predicts multiple tasks in multiple advertising scenarios.
    Customer Price Sensitivities in Competitive Automobile Insurance Markets. (arXiv:2101.08551v2 [stat.AP] UPDATED)
    Insurers are increasingly adopting more demand-based strategies to incorporate the indirect effect of premium changes on their policyholders' willingness to stay. However, since in practice both insurers' renewal premia and customers' responses to these premia typically depend on the customer's level of risk, it remains challenging in these strategies to determine how to properly control for this confounding. We therefore consider a causal inference approach in this paper to account for customers' price sensitivity and to deduce optimal, multi-period profit maximizing premium renewal offers. More specifically, we extend the discrete treatment framework of Guelman and Guill\'en (2014) by Extreme Gradient Boosting, or XGBoost, and by multiple imputation to better account for the uncertainty in the counterfactual responses. We additionally introduce the continuous treatment framework with XGBoost to the insurance literature to allow identification of the exact optimal renewal offers and account for any competition in the market by including competitor offers. The application of the two treatment frameworks to a Dutch automobile insurance portfolio suggests that a policy's competitiveness in the market is crucial for a customer's price sensitivity and that XGBoost is more appropriate to describe this than the traditional logistic regression. Moreover, an efficient frontier of both frameworks indicates that substantially more profit can be gained on the portfolio than realized, also already with less churn and in particular if we allow for continuous rate changes. A multi-period renewal optimization confirms these findings and demonstrates that the competitiveness enables temporal feedback of previous rate changes on future demand.  ( 2 min )
    Network Inference and Influence Maximization from Samples. (arXiv:2106.03403v2 [cs.SI] UPDATED)
    Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the influence spread from these seeds. It has been widely investigated in the past two decades. In the canonical setting, the social network and its diffusion parameters are given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the sets of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS) and present constant approximation algorithms for it under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Compared with prior solutions, our network inference algorithms require weaker assumptions and do not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.  ( 2 min )
    Enhanced DeepONet for Modeling Partial Differential Operators Considering Multiple Input Functions. (arXiv:2202.08942v1 [cs.LG])
    Machine learning, especially deep learning is gaining much attention due to the breakthrough performance in various cognitive applications. Recently, neural networks (NN) have been intensively explored to model partial differential equations as NN can be viewed as universal approximators for nonlinear functions. A deep network operator (DeepONet) architecture was proposed to model the general non-linear continuous operators for partial differential equations (PDE) due to its better generalization capabilities than existing mainstream deep neural network architectures. However, existing DeepONet can only accept one input function, which limits its application. In this work, we explore the DeepONet architecture to extend it to accept two or more input functions. We propose new Enhanced DeepONet or EDeepONet high-level neural network structure, in which two input functions are represented by two branch DNN sub-networks, which are then connected with output truck network via inner product to generate the output of the whole neural network. The proposed EDeepONet structure can be easily extended to deal with multiple input functions. Our numerical results on modeling two partial differential equation examples shows that the proposed enhanced DeepONet is about 7X-17X or about one order of magnitude more accurate than the fully connected neural network and is about 2X-3X more accurate than a simple extended DeepONet for both training and test.  ( 2 min )
    Graph Self-Supervised Learning: A Survey. (arXiv:2103.00111v4 [cs.LG] UPDATED)
    Deep learning on graphs has attracted significant interests recently. However, most of the works have focused on (semi-) supervised learning, resulting in shortcomings including heavy label reliance, poor generalization, and weak robustness. To address these issues, self-supervised learning (SSL), which extracts informative knowledge through well-designed pretext tasks without relying on manual labels, has become a promising and trending learning paradigm for graph data. Different from SSL on other domains like computer vision and natural language processing, SSL on graphs has an exclusive background, design ideas, and taxonomies. Under the umbrella of graph self-supervised learning, we present a timely and comprehensive review of the existing approaches which employ SSL techniques for graph data. We construct a unified framework that mathematically formalizes the paradigm of graph SSL. According to the objectives of pretext tasks, we divide these approaches into four categories: generation-based, auxiliary property-based, contrast-based, and hybrid approaches. We further describe the applications of graph SSL across various research fields and summarize the commonly used datasets, evaluation benchmark, performance comparison and open-source codes of graph SSL. Finally, we discuss the remaining challenges and potential future directions in this research field.  ( 2 min )
    Morphence: Moving Target Defense Against Adversarial Examples. (arXiv:2108.13952v4 [cs.LG] UPDATED)
    Robustness to adversarial examples of machine learning models remains an open topic of research. Attacks often succeed by repeatedly probing a fixed target model with adversarial examples purposely crafted to fool it. In this paper, we introduce Morphence, an approach that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is seamlessly replaced by a new model pool generated in advance. We evaluate Morphence on two benchmark image classification datasets (MNIST and CIFAR10) against five reference attacks (2 white-box and 3 black-box). In all cases, Morphence consistently outperforms the thus-far effective defense, adversarial training, even in the face of strong white-box attacks, while preserving accuracy on clean data.
    Ten years of image analysis and machine learning competitions in dementia. (arXiv:2112.07922v2 [cs.LG] UPDATED)
    Machine learning methods exploiting multi-parametric biomarkers, especially based on neuroimaging, have huge potential to improve early diagnosis of dementia and to predict which individuals are at-risk of developing dementia. To benchmark algorithms in the field of machine learning and neuroimaging in dementia and assess their potential for use in clinical practice and clinical trials, seven grand challenges have been organized in the last decade. The seven grand challenges addressed questions related to screening, clinical status estimation, prediction and monitoring in (pre-clinical) dementia. There was little overlap in clinical questions, tasks and performance metrics. Whereas this aids providing insight on a broad range of questions, it also limits the validation of results across challenges. The validation process itself was mostly comparable between challenges, using similar methods for ensuring objective comparison, uncertainty estimation and statistical testing. In general, winning algorithms performed rigorous data preprocessing and combined a wide range of input features. Despite high state-of-the-art performances, most of the methods evaluated by the challenges are not clinically used. To increase impact, future challenges could pay more attention to statistical analysis of which factors relate to higher performance, to clinical questions beyond Alzheimer's disease, and to using testing data beyond the Alzheimer's Disease Neuroimaging Initiative. Grand challenges would be an ideal venue for assessing the generalizability of algorithm performance to unseen data of other cohorts. Key for increasing impact in this way are larger testing data sizes, which could be reached by sharing algorithms rather than data to exploit data that cannot be shared.
    Some Inapproximability Results of MAP Inference and Exponentiated Determinantal Point Processes. (arXiv:2109.00727v2 [cs.DS] UPDATED)
    We study the computational complexity of two hard problems on determinantal point processes (DPPs). One is maximum a posteriori (MAP) inference, i.e., to find a principal submatrix having the maximum determinant. The other is probabilistic inference on exponentiated DPPs (E-DPPs), which can sharpen or weaken the diversity preference of DPPs with an exponent parameter $p$. We present several complexity-theoretic hardness results that explain the difficulty in approximating MAP inference and the normalizing constant for E-DPPs. We first prove that unconstrained MAP inference for an $n \times n$ matrix is $\textsf{NP}$-hard to approximate within a factor of $2^{\beta n}$, where $\beta = 10^{-10^{13}} $. This result improves upon the best-known inapproximability factor of $(\frac{9}{8}-\epsilon)$, and rules out the existence of any polynomial-factor approximation algorithm assuming $\textsf{P} \neq \textsf{NP}$. We then show that log-determinant maximization is $\textsf{NP}$-hard to approximate within a factor of $\frac{5}{4}$ for the unconstrained case and within a factor of $1+10^{-10^{13}}$ for the size-constrained monotone case. In particular, log-determinant maximization does not admit a polynomial-time approximation scheme unless $\textsf{P} = \textsf{NP}$. As a corollary of the first result, we demonstrate that the normalizing constant for E-DPPs of any (fixed) constant exponent $p \geq \beta^{-1} = 10^{10^{13}}$ is $\textsf{NP}$-hard to approximate within a factor of $2^{\beta pn}$, which is in contrast to the case of $p \leq 1$ admitting a fully polynomial-time randomized approximation scheme.  ( 2 min )
    Federated Asymptotics: a model to compare federated learning algorithms. (arXiv:2108.07313v3 [cs.LG] UPDATED)
    We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development.  ( 2 min )
    Optimal Coreset for Gaussian Kernel Density Estimation. (arXiv:2007.08031v4 [cs.DS] UPDATED)
    Given a point set $P\subset \mathbb{R}^d$, the kernel density estimate of $P$ is defined as \[ \overline{\mathcal{G}}_P(x) = \frac{1}{\left|P\right|}\sum_{p\in P}e^{-\left\lVert x-p \right\rVert^2} \] for any $x\in\mathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimate of $P$ is approximated by the kernel density estimate of $Q$. This subset $Q$ is called a coreset. The main technique in this work is constructing a $\pm 1$ coloring on the point set $P$ by discrepancy theory and we leverage Banaszczyk's Theorem. When $d>1$ is a constant, our construction gives a coreset of size $O\left(\frac{1}{\varepsilon}\right)$ as opposed to the best-known result of $O\left(\frac{1}{\varepsilon}\sqrt{\log\frac{1}{\varepsilon}}\right)$. It is the first result to give a breakthrough on the barrier of $\sqrt{\log}$ factor even when $d=2$.  ( 2 min )
    Multi-Level Local SGD for Heterogeneous Hierarchical Networks. (arXiv:2007.13819v3 [cs.LG] UPDATED)
    We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.
    Learning Weakly-Supervised Contrastive Representations. (arXiv:2202.06670v2 [cs.LG] UPDATED)
    We argue that a form of the valuable information provided by the auxiliary information is its implied data clustering information. For instance, considering hashtags as auxiliary information, we can hypothesize that an Instagram image will be semantically more similar with the same hashtags. With this intuition, we present a two-stage weakly-supervised contrastive learning approach. The first stage is to cluster data according to its auxiliary information. The second stage is to learn similar representations within the same cluster and dissimilar representations for data from different clusters. Our empirical experiments suggest the following three contributions. First, compared to conventional self-supervised representations, the auxiliary-information-infused representations bring the performance closer to the supervised representations, which use direct downstream labels as supervision signals. Second, our approach performs the best in most cases, when comparing our approach with other baseline representation learning methods that also leverage auxiliary data information. Third, we show that our approach also works well with unsupervised constructed clusters (e.g., no auxiliary information), resulting in a strong unsupervised representation learning approach.
    EEG-based Cross-Subject Driver Drowsiness Recognition with an Interpretable Convolutional Neural Network. (arXiv:2107.09507v4 [eess.SP] UPDATED)
    In the context of electroencephalogram (EEG)-based driver drowsiness recognition, it is still challenging to design a calibration-free system, since EEG signals vary significantly among different subjects and recording sessions. Many efforts have been made to use deep learning methods for mental state recognition from EEG signals. However, existing work mostly treats deep learning models as black-box classifiers, while what have been learned by the models and to which extent they are affected by the noise in EEG data are still underexplored. In this paper, we develop a novel convolutional neural network combined with an interpretation technique that allows sample-wise analysis of important features for classification. The network has a compact structure and takes advantage of separable convolutions to process the EEG signals in a spatial-temporal sequence. Results show that the model achieves an average accuracy of 78.35% on 11 subjects for leave-one-out cross-subject drowsiness recognition, which is higher than the conventional baseline methods of 53.40%-72.68% and state-of-the-art deep learning methods of 71.75%-75.19%. Interpretation results indicate the model has learned to recognize biologically meaningful features from EEG signals, e.g., Alpha spindles, as strong indicators of drowsiness across different subjects. In addition, we also explore reasons behind some wrongly classified samples with the interpretation technique and discuss potential ways to improve the recognition accuracy. Our work illustrates a promising direction on using interpretable deep learning models to discover meaningful patterns related to different mental states from complex EEG signals.
    Fast Rank-1 NMF for Missing Data with KL Divergence. (arXiv:2110.12595v2 [stat.ML] UPDATED)
    We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.
    A global convergence theory for deep ReLU implicit networks via over-parameterization. (arXiv:2110.05645v2 [cs.LG] UPDATED)
    Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.
    Simple data balancing achieves competitive worst-group-accuracy. (arXiv:2110.14503v2 [cs.LG] UPDATED)
    We study the problem of learning classifiers that perform well across (known or unknown) groups of data. After observing that common worst-group-accuracy datasets suffer from substantial imbalances, we set out to compare state-of-the-art methods to simple balancing of classes and groups by either subsampling or reweighting data. Our results show that these data balancing baselines achieve state-of-the-art-accuracy, while being faster to train and requiring no additional hyper-parameters. In addition, we highlight that access to group information is most critical for model selection purposes, and not so much during training. All in all, our findings beg closer examination of benchmarks and methods for research in worst-group-accuracy optimization.
    AESPA: Accuracy Preserving Low-degree Polynomial Activation for Fast Private Inference. (arXiv:2201.06699v2 [cs.CR] UPDATED)
    Hybrid private inference (PI) protocol, which synergistically utilizes both multi-party computation (MPC) and homomorphic encryption, is one of the most prominent techniques for PI. However, even the state-of-the-art PI protocols are bottlenecked by the non-linear layers, especially the activation functions. Although a standard non-linear activation function can generate higher model accuracy, it must be processed via a costly garbled-circuit MPC primitive. A polynomial activation can be processed via Beaver's multiplication triples MPC primitive but has been incurring severe accuracy drops so far. In this paper, we propose an accuracy preserving low-degree polynomial activation function (AESPA) that exploits the Hermite expansion of the ReLU and basis-wise normalization. We apply AESPA to popular ML models, such as VGGNet, ResNet, and pre-activation ResNet, to show an inference accuracy comparable to those of the standard models with ReLU activation, achieving superior accuracy over prior low-degree polynomial studies. When applied to the all-RELU baseline on the state-of-the-art Delphi PI protocol, AESPA shows up to 42.1x and 28.3x lower online latency and communication cost.
    Medical Dead-ends and Learning to Identify High-risk States and Treatments. (arXiv:2110.04186v3 [cs.LG] UPDATED)
    Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible "dead-ends" of a state space. We focus on the condition of patients in the intensive care unit, where a "medical dead-end" indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate "treatment security" as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.
    The Computational Drug Repositioning without Negative Sampling. (arXiv:2111.14696v2 [cs.LG] UPDATED)
    Computational drug repositioning technology is an effective tool to accelerate drug development. Although this technique has been widely used and successful in recent decades, many existing models still suffer from multiple drawbacks such as the massive number of unvalidated drug-disease associations and the inner product. The limitations of these works are mainly due to the following two reasons: firstly, previous works used negative sampling techniques to treat unvalidated drug-disease associations as negative samples, which is invalid in real-world settings; secondly, the inner product cannot fully take into account the feature information contained in the latent factor of drug and disease. In this paper, we propose a novel PUON framework for addressing the above deficiencies, which models the risk estimator of computational drug repositioning only using validated (Positive) and unvalidated (Unlabelled) drug-disease associations without employing negative sampling techniques. The PUON also proposed an Outer Neighborhood-based classifier for modeling the cross-feature information of latent facotors of drugs and diseases. For a comprehensive comparison, we considered 8 popular baselines. Extensive experiments in three real-world datasets showed that PUON achieved the best performance based on 3 evaluation metrics.
    On the relationship between predictive coding and backpropagation. (arXiv:2106.13082v4 [q-bio.NC] UPDATED)
    Artificial neural networks are often interpreted as abstract models of biological neuronal networks, but they are typically trained using the biologically unrealistic backpropagation algorithm and its variants. Predictive coding has been proposed as a potentially more biologically realistic alternative to backpropagation for training neural networks. This manuscript reviews and extends recent work on the mathematical relationship between predictive coding and backpropagation for training feedforward artificial neural networks on supervised learning tasks. Implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning are discussed along with a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.
    Variational Diffusion Models. (arXiv:2107.00630v3 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.
    Learning Patterns in Imaginary Vowels for an Intelligent Brain Computer Interface (BCI) Design. (arXiv:2010.12066v2 [cs.LG] UPDATED)
    Technology advancements made it easy to measure non-invasive and high-quality electroencephalograph (EEG) signals from human's brain. Hence, development of robust and high-performance AI algorithms becomes crucial to properly process the EEG signals and recognize the patterns, which lead to an appropriate control signal. Despite the advancements in processing the motor imagery EEG signals, the healthcare applications, such as emotion detection, are still in the early stages of AI design. In this paper, we propose a modular framework for the recognition of vowels as the AI part of a brain computer interface system. We carefully designed the modules to discriminate the English vowels given the raw EEG signals, and meanwhile avoid the typical issued with the data-poor environments like most of the healthcare applications. The proposed framework consists of appropriate signal segmentation, filtering, extraction of spectral features, reducing the dimensions by means of principle component analysis, and finally a multi-class classification by decision-tree-based support vector machine (DT-SVM). The performance of our framework was evaluated by a combination of test-set and resubstitution (also known as apparent) error rates. We provide the algorithms of the proposed framework to make it easy for future researchers and developers who want to follow the same workflow.
    Federated Multi-Task Learning under a Mixture of Distributions. (arXiv:2108.10252v3 [cs.LG] UPDATED)
    The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.
    Intrusion Prevention through Optimal Stopping. (arXiv:2111.00289v6 [cs.LG] UPDATED)
    We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal policy. Our method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that our approach can produce effective defender policies for a practical IT infrastructure of limited size. Inspection of the learned policies confirms that they exhibit threshold properties.
    When OT meets MoM: Robust estimation of Wasserstein Distance. (arXiv:2006.10325v3 [stat.ML] UPDATED)
    Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage Medians of Means (MoM) estimators to robustify the estimation of Wasserstein distance. Exploiting the dual Kantorovitch formulation of Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. These MoM estimators enable to make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Eventually, we discuss how to combine MoM with the entropy-regularized approximation of the Wasserstein distance and propose a simple MoM-based re-weighting scheme that could be used in conjunction with the Sinkhorn algorithm.
    Signal Decomposition Using Masked Proximal Operators. (arXiv:2202.09338v1 [cs.LG])
    We consider the well-studied problem of decomposing a vector time series signal into components with different characteristics, such as smooth, periodic, nonnegative, or sparse. We propose a simple and general framework in which the components are defined by loss functions (which include constraints), and the signal decomposition is carried out by minimizing the sum of losses of the components (subject to the constraints). When each loss function is the negative log-likelihood of a density for the signal component, our method coincides with maximum a posteriori probability (MAP) estimation; but it also includes many other interesting cases. We give two distributed optimization methods for computing the decomposition, which find the optimal decomposition when the component class loss functions are convex, and are good heuristics when they are not. Both methods require only the masked proximal operator of each of the component loss functions, a generalization of the well-known proximal operator that handles missing entries in its argument. Both methods are distributed, i.e., handle each component separately. We derive tractable methods for evaluating the masked proximal operators of some loss functions that, to our knowledge, have not appeared in the literature.
    Improving Intrinsic Exploration with Language Abstractions. (arXiv:2202.08938v1 [cs.LG])
    Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.
    Federated Learning via Plurality Vote. (arXiv:2110.02998v2 [cs.LG] UPDATED)
    Federated learning allows collaborative workers to solve a machine learning problem while preserving data privacy. Recent studies have tackled various challenges in federated learning, but the joint optimization of communication overhead, learning reliability, and deployment efficiency is still an open problem. To this end, we propose a new scheme named federated learning via plurality vote (FedVote). In each communication round of FedVote, workers transmit binary or ternary weights to the server with low communication overhead. The model parameters are aggregated via weighted voting to enhance the resilience against Byzantine attacks. When deployed for inference, the model with binary or ternary weights is resource-friendly to edge devices. We show that our proposed method can reduce quantization error and converges faster compared with the methods directly quantizing the model updates.
    A Fair Comparison of Graph Neural Networks for Graph Classification. (arXiv:1912.09893v3 [cs.LG] UPDATED)
    Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.
    tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices. (arXiv:2202.09297v1 [eess.SP])
    Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power. Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning-based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 $\mu$J while maintaining up to 45% higher utility compared to prior approaches.
    Exploring Adversarially Robust Training for Unsupervised Domain Adaptation. (arXiv:2202.09300v1 [cs.CV])
    Unsupervised Domain Adaptation (UDA) methods aim to transfer knowledge from a labeled source domain to an unlabeled target domain. UDA has been extensively studied in the computer vision literature. Deep networks have been shown to be vulnerable to adversarial attacks. However, very little focus is devoted to improving the adversarial robustness of deep UDA models, causing serious concerns about model reliability. Adversarial Training (AT) has been considered to be the most successful adversarial defense approach. Nevertheless, conventional AT requires ground-truth labels to generate adversarial examples and train models, which limits its effectiveness in the unlabeled target domain. In this paper, we aim to explore AT to robustify UDA models: How to enhance the unlabeled data robustness via AT while learning domain-invariant features for UDA? To answer this, we provide a systematic study into multiple AT variants that potentially apply to UDA. Moreover, we propose a novel Adversarially Robust Training method for UDA accordingly, referred to as ARTUDA. Extensive experiments on multiple attacks and benchmarks show that ARTUDA consistently improves the adversarial robustness of UDA models.
    Machine Learning Models in Stock Market Prediction. (arXiv:2202.09359v1 [q-fin.ST])
    The paper focuses on predicting the Nifty 50 Index by using 8 Supervised Machine Learning Models. The techniques used for empirical study are Adaptive Boost (AdaBoost), k-Nearest Neighbors (kNN), Linear Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM) and Decision Trees (DT). Experiments are based on historical data of Nifty 50 Index of Indian Stock Market from 22nd April, 1996 to 16th April, 2021, which is time series data of around 25 years. During the period there were 6220 trading days excluding all the non trading days. The entire trading dataset was divided into 4 subsets of different size-25% of entire data, 50% of entire data, 75% of entire data and entire data. Each subset was further divided into 2 parts-training data and testing data. After applying 3 tests- Test on Training Data, Test on Testing Data and Cross Validation Test on each subset, the prediction performance of the used models were compared and after comparison, very interesting results were found. The evaluation results indicate that Adaptive Boost, k- Nearest Neighbors, Random Forest and Decision Trees under performed with increase in the size of data set. Linear Regression and Artificial Neural Network shown almost similar prediction results among all the models but Artificial Neural Network took more time in training and validating the model. Thereafter Support Vector Machine performed better among rest of the models but with increase in the size of data set, Stochastic Gradient Descent performed better than Support Vector Machine.
    Deep Movement Primitives: toward Breast Cancer Examination Robot. (arXiv:2202.09265v1 [cs.RO])
    Breast cancer is the most common type of cancer worldwide. A robotic system performing autonomous breast palpation can make a significant impact on the related health sector worldwide. However, robot programming for breast palpating with different geometries is very complex and unsolved. Robot learning from demonstrations (LfD) reduces the programming time and cost. However, the available LfD are lacking the modelling of the manipulation path/trajectory as an explicit function of the visual sensory information. This paper presents a novel approach to manipulation path/trajectory planning called deep Movement Primitives that successfully generates the movements of a manipulator to reach a breast phantom and perform the palpation. We show the effectiveness of our approach by a series of real-robot experiments of reaching and palpating a breast phantom. The experimental results indicate our approach outperforms the state-of-the-art method.
    Amenable Sparse Network Investigator. (arXiv:2202.09284v1 [cs.LG])
    As the optimization problem of pruning a neural network is nonconvex and the strategies are only guaranteed to find local solutions, a good initialization becomes paramount. To this end, we present the Amenable Sparse Network Investigator ASNI algorithm that learns a sparse network whose initialization is compressed. The learned sparse structure found by ASNI is amenable since its corresponding initialization, which is also learned by ASNI, consists of only 2L numbers, where L is the number of layers. Requiring just a few numbers for parameter initialization of the learned sparse network makes the sparse network amenable. The learned initialization set consists of L signed pairs that act as the centroids of parameter values of each layer. These centroids are learned by the ASNI algorithm after only one single round of training. We experimentally show that the learned centroids are sufficient to initialize the nonzero parameters of the learned sparse structure in order to achieve approximately the accuracy of non-sparse network. We also empirically show that in order to learn the centroids, one needs to prune the network globally and gradually. Hence, for parameter pruning we propose a novel strategy based on a sigmoid function that specifies the sparsity percentage across the network globally. Then, pruning is done magnitude-wise and after each epoch of training. We have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets.
    Toward a traceable, explainable, and fairJD/Resume recommendation system. (arXiv:2202.08960v1 [cs.IR])
    In the last few decades, companies are interested to adopt an online automated recruitment process in an international recruitment environment. The problem is that the recruitment of employees through the manual procedure is a time and money consuming process. As a result, processing a significant number of applications through conventional methods can lead to the recruitment of clumsy individuals. Different JD/Resume matching model architectures have been proposed and reveal a high accuracy level in selecting relevant candidatesfor the required job positions. However, the development of an automatic recruitment system is still one of the main challenges. The reason is that the development of a fully automated recruitment system is a difficult task and poses different challenges. For example, providing a detailed matching explanation for the targeted stakeholders is needed to ensure a transparent recommendation. There are several knowledge bases that represent skills and competencies (e.g, ESCO, O*NET) that are used to identify the candidate and the required job skills for a matching purpose. Besides, modernpre-trained language models are fine-tuned for this context such as identifying lines where a specific feature was introduced. Typically, pre-trained language models use transfer-based machine learning models to be fine-tuned for a specific field. In this proposal, our aim is to explore how modern language models (based on transformers) can be combined with knowledge bases and ontologies to enhance the JD/Resume matching process. Our system aims at using knowledge bases and features to support the explainability of the JD/Resume matching. Finally, given that multiple software components, datasets, ontology, andmachine learning models will be explored, we aim at proposing a fair, ex-plainable, and traceable architecture for a Resume/JD matching purpose.
    Autoencoding Low-Resolution MRI for Semantically Smooth Interpolation of Anisotropic MRI. (arXiv:2202.09258v1 [eess.IV])
    High-resolution medical images are beneficial for analysis but their acquisition may not always be feasible. Alternatively, high-resolution images can be created from low-resolution acquisitions using conventional upsampling methods, but such methods cannot exploit high-level contextual information contained in the images. Recently, better performing deep-learning based super-resolution methods have been introduced. However, these methods are limited by their supervised character, i.e. they require high-resolution examples for training. Instead, we propose an unsupervised deep learning semantic interpolation approach that synthesizes new intermediate slices from encoded low-resolution examples. To achieve semantically smooth interpolation in through-plane direction, the method exploits the latent space generated by autoencoders. To generate new intermediate slices, latent space encodings of two spatially adjacent slices are combined using their convex combination. Subsequently, the combined encoding is decoded to an intermediate slice. To constrain the model, a notion of semantic similarity is defined for a given dataset. For this, a new loss is introduced that exploits the spatial relationship between slices of the same volume. During training, an existing in-between slice is generated using a convex combination of its neighboring slice encodings. The method was trained and evaluated using publicly available cardiac cine, neonatal brain and adult brain MRI scans. In all evaluations, the new method produces significantly better results in terms of Structural Similarity Index Measure and Peak Signal-to-Noise Ratio (p< 0.001 using one-sided Wilcoxon signed-rank test) than a cubic B-spline interpolation approach. Given the unsupervised nature of the method, high-resolution training data is not required and hence, the method can be readily applied in clinical settings.
    Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models. (arXiv:2202.08974v1 [cs.SD])
    Automatic emotion recognition plays a key role in computer-human interaction as it has the potential to enrich the next-generation artificial intelligence with emotional intelligence. It finds applications in customer and/or representative behavior analysis in call centers, gaming, personal assistants, and social robots, to mention a few. Therefore, there has been an increasing demand to develop robust automatic methods to analyze and recognize the various emotions. In this paper, we propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. More specifically, we i) adapt a residual network (ResNet) based model trained on a large-scale speaker recognition task using transfer learning along with a spectrogram augmentation approach to recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder representations from transformers (BERT) based model to represent and recognize emotions from the text. The proposed system then combines the ResNet and BERT-based model scores using a late fusion strategy to further improve the emotion recognition performance. The proposed multimodal solution addresses the data scarcity limitation in emotion recognition using transfer learning, data augmentation, and fine-tuning, thereby improving the generalization performance of the emotion recognition models. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v1 [stat.ML])
    We address the differentially private estimation of multiple quantiles (MQ) of a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem and establish that the resulting method is closely related to the current state-of-the-art, the JointExp algorithm, sharing in particular the same computational complexity and a similar efficiency. However, we demonstrate both theoretically and empirically that (non-smoothed) JointExp suffers from an important lack of performance in the case of peaked distributions, with a potentially catastrophic impact in the presence of atoms. While its smoothed version would allow to leverage the performance guarantees of IS, it remains an open challenge to implement. As a proxy to fix the problem we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
    Scalable approach to many-body localization via quantum data. (arXiv:2202.08853v1 [cond-mat.dis-nn])
    We are interested in how quantum data can allow for practical solutions to otherwise difficult computational problems. A notoriously difficult phenomenon from quantum many-body physics is the emergence of many-body localization (MBL). So far, is has evaded a comprehensive analysis. In particular, numerical studies are challenged by the exponential growth of the Hilbert space dimension. As many of these studies rely on exact diagonalization of the system's Hamiltonian, only small system sizes are accessible. In this work, we propose a highly flexible neural network based learning approach that, once given training data, circumvents any computationally expensive step. In this way, we can efficiently estimate common indicators of MBL such as the adjacent gap ratio or entropic quantities. Our estimator can be trained on data from various system sizes at once which grants the ability to extrapolate from smaller to larger ones. Moreover, using transfer learning we show that already a two-dimensional feature vector is sufficient to obtain several different indicators at various energy densities at once. We hope that our approach can be applied to large-scale quantum experiments to provide new insights into quantum many-body physics.
    Graph Auto-Encoder Via Neighborhood Wasserstein Reconstruction. (arXiv:2202.09025v1 [cs.LG])
    Graph neural networks (GNNs) have drawn significant research attention recently, mostly under the setting of semi-supervised learning. When task-agnostic representations are preferred or supervision is simply unavailable, the auto-encoder framework comes in handy with a natural graph reconstruction objective for unsupervised GNN training. However, existing graph auto-encoders are designed to reconstruct the direct links, so GNNs trained in this way are only optimized towards proximity-oriented graph mining tasks, and will fall short when the topological structures matter. In this work, we revisit the graph encoding process of GNNs which essentially learns to encode the neighborhood information of each node into an embedding vector, and propose a novel graph decoder to reconstruct the entire neighborhood information regarding both proximity and structure via Neighborhood Wasserstein Reconstruction (NWR). Specifically, from the GNN embedding of each node, NWR jointly predicts its node degree and neighbor feature distribution, where the distribution prediction adopts an optimal-transport loss based on the Wasserstein distance. Extensive experiments on both synthetic and real-world network datasets show that the unsupervised node representations learned with NWR have much more advantageous in structure-oriented graph mining tasks, while also achieving competitive performance in proximity-oriented ones.
    Graph Data Augmentation for Graph Machine Learning: A Survey. (arXiv:2202.08871v1 [cs.LG])
    Data augmentation has recently seen increased interest in graph machine learning given its ability of creating extra training data and improving model generalization. Despite this recent upsurge, this area is still relatively underexplored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of data. In this paper, we present a comprehensive and systematic survey of graph data augmentation that summarizes the literature in a structured manner. We first categorize graph data augmentation operations based on the components of graph data they modify or create. Next, we introduce recent advances in graph data augmentation, separating by their learning objectives and methodologies. We conclude by outlining currently unsolved challenges as well as directions for future research. Overall, this paper aims to clarify the landscape of existing literature in graph data augmentation and motivate additional work in this area. We provide a GitHub repository (https://github.com/zhao-tong/graph-data-augmentation-papers) with a reading list that will be continuously updated.
    A Note on the Implicit Bias Towards Minimal Depth of Deep Neural Networks. (arXiv:2202.09028v1 [cs.LG])
    Deep learning systems have steadily advanced the state of the art in a wide variety of benchmarks, demonstrating impressive performance in tasks ranging from image classification \citep{taigman2014deepface,zhai2021scaling}, language processing \citep{devlin-etal-2019-bert,NEURIPS2020_1457c0d6}, open-ended environments \citep{SilverHuangEtAl16nature,arulkumaran2019alphastar}, to coding \citep{chen2021evaluating}. A central aspect that enables the success of these systems is the ability to train deep models instead of wide shallow ones \citep{7780459}. Intuitively, a neural network is decomposed into hierarchical representations from raw data to high-level, more abstract features. While training deep neural networks repetitively achieves superior performance against their shallow counterparts, an understanding of the role of depth in representation learning is still lacking. In this work, we suggest a new perspective on understanding the role of depth in deep learning. We hypothesize that {\bf\em SGD training of overparameterized neural networks exhibits an implicit bias that favors solutions of minimal effective depth}. Namely, SGD trains neural networks for which the top several layers are redundant. To evaluate the redundancy of layers, we revisit the recently discovered phenomenon of neural collapse \citep{Papyan24652,han2021neural}.
    Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge. (arXiv:2202.09248v1 [cs.LG])
    Injecting gaussian noise into training features is well known to have regularization properties. This paper considers noise injections to numeric or categoric tabular features as passed to inference, which translates inference to a non-deterministic outcome and may have relevance to fairness considerations, adversarial example protection, or other use cases benefiting from non-determinism. We offer the Automunge library for tabular preprocessing as a resource for the practice, which includes options to integrate random sampling or entropy seeding with the support of quantum circuits for an improved randomness profile in comparison to pseudo random number generators. Benchmarking shows that neural networks may demonstrate an improved performance when a known noise profile is mitigated with corresponding injections to both training and inference, and that gradient boosting appears to be robust to a mild noise profile in inference, suggesting that stochastic perturbations could be integrated into existing data pipelines for prior trained gradient boosting models.
    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v1 [cs.LG])
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
    On Variance Estimation of Random Forests. (arXiv:2202.09008v1 [stat.ML])
    Ensemble methods based on subsampling, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.
    Training neural networks using monotone variational inequality. (arXiv:2202.08876v1 [stat.ML])
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. Inspired by a recent work of (Juditsky & Nemirovsky, 2019), instead of using the traditional loss function minimization approach, we reduce the training of the network parameters to another problem with convex structure -- to solve a monotone variational inequality (MVI). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery accuracy and prediction accuracy under the theoretical setting of training one-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrates its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to the stochastic gradient descent (SGD) on both synthetic and real network data prediction tasks regarding various performance metrics.
    A Summary of the ComParE COVID-19 Challenges. (arXiv:2202.08981v1 [cs.SD])
    The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).
    Proximal Optimal Transport Modeling of Population Dynamics. (arXiv:2106.06345v4 [cs.LG] UPDATED)
    We propose a new approach to model the collective dynamics of a population of particles evolving with time. As is often the case in challenging scientific applications, notably single-cell genomics, measuring features for these particles requires destroying them. As a result, the population can only be monitored with periodic snapshots, obtained by sampling a few particles that are sacrificed in exchange for measurements. Given only access to these snapshots, can we reconstruct likely individual trajectories for all other particles? We propose to model these trajectories as collective realizations of a causal Jordan-Kinderlehrer-Otto (JKO) flow of measures: The JKO scheme posits that the new configuration taken by a population at time $t+1$ is one that trades off an improvement, in the sense that it decreases an energy, while remaining close (in Wasserstein distance) to the previous configuration observed at $t$. In order to learn such an energy using only snapshots, we propose JKOnet, a neural architecture that computes (in end-to-end differentiable fashion) the JKO flow given a parametric energy and initial configuration of points. We demonstrate the good performance and robustness of the JKOnet fitting procedure, compared to a more direct forward method.
    Neo: Generalizing Confusion Matrix Visualization to Hierarchical and Multi-Output Labels. (arXiv:2110.12536v2 [cs.HC] UPDATED)
    The confusion matrix, a ubiquitous visualization for helping people evaluate machine learning models, is a tabular layout that compares predicted class labels against actual class labels over all data instances. We conduct formative research with machine learning practitioners at Apple and find that conventional confusion matrices do not support more complex data-structures found in modern-day applications, such as hierarchical and multi-output labels. To express such variations of confusion matrices, we design an algebra that models confusion matrices as probability distributions. Based on this algebra, we develop Neo, a visual analytics system that enables practitioners to flexibly author and interact with hierarchical and multi-output confusion matrices, visualize derived metrics, renormalize confusions, and share matrix specifications. Finally, we demonstrate Neo's utility with three model evaluation scenarios that help people better understand model performance and reveal hidden confusions.
    Deep Reinforcement Learning Based Multi-Access Edge Computing Schedule for Internet of Vehicle. (arXiv:2202.08972v1 [cs.LG])
    As intelligent transportation systems been implemented broadly and unmanned arial vehicles (UAVs) can assist terrestrial base stations acting as multi-access edge computing (MEC) to provide a better wireless network communication for Internet of Vehicles (IoVs), we propose a UAVs-assisted approach to help provide a better wireless network service retaining the maximum Quality of Experience(QoE) of the IoVs on the lane. In the paper, we present a Multi-Agent Graph Convolutional Deep Reinforcement Learning (M-AGCDRL) algorithm which combines local observations of each agent with a low-resolution global map as input to learn a policy for each agent. The agents can share their information with others in graph attention networks, resulting in an effective joint policy. Simulation results show that the M-AGCDRL method enables a better QoE of IoTs and achieves good performance.
    Strategyproof Learning: Building Trustworthy User-Generated Datasets. (arXiv:2106.02398v2 [cs.LG] UPDATED)
    We prove in this paper that, perhaps surprisingly, incentivizing data misreporting is not a fatality. By leveraging a careful design of the loss function, we propose Licchavi, a global and personalized learning framework with provable strategyproofness guarantees. Essentially, we prove that no user can gain much by replying to Licchavi's queries with answers that deviate from their true preferences. Interestingly, Licchavi also promotes the desirable "one person, one unit-force vote" fairness principle. Furthermore, our empirical evaluation of its performance showcases Licchavi's real-world applicability. We believe that our results are critical for the safety of any learning scheme that leverages user-generated data.
    Case-based Reasoning for Better Generalization in Text-Adventure Games. (arXiv:2110.08470v2 [cs.CL] UPDATED)
    Text-based games (TBG) have emerged as promising environments for driving research in grounded language understanding and studying problems like generalization and sample efficiency. Several deep reinforcement learning (RL) methods with varying architectures and learning schemes have been proposed for TBGs. However, these methods fail to generalize efficiently, especially under distributional shifts. In a departure from deep RL approaches, in this paper, we propose a general method inspired by case-based reasoning to train agents and generalize out of the training distribution. The case-based reasoner collects instances of positive experiences from the agent's interaction with the world in the past and later reuses the collected experiences to act efficiently. The method can be applied in conjunction with any existing on-policy neural agent in the literature for TBGs. Our experiments show that the proposed approach consistently improves existing methods, obtains good out-of-distribution generalization, and achieves new state-of-the-art results on widely used environments.
    A Theoretical Perspective on Hyperdimensional Computing. (arXiv:2010.07426v3 [cs.LG] UPDATED)
    Hyperdimensional (HD) computing is a set of neurally inspired methods for obtaining high-dimensional, low-precision, distributed representations of data. These representations can be combined with simple, neurally plausible algorithms to effect a variety of information processing tasks. HD computing has recently garnered significant interest from the computer hardware community as an energy-efficient, low-latency, and noise-robust tool for solving learning problems. In this review, we present a unified treatment of the theoretical foundations of HD computing with a focus on the suitability of representations for learning.
    Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder. (arXiv:2202.09315v1 [cs.LG])
    In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model that can be seen as an extension of the variational autoencoder for the modeling of temporal sequences. It is included in DVAE-UMOT to model the objects' dynamics, after being pre-trained on an unlabeled synthetic dataset of single-object trajectories. Then the distributions and parameters of DVAE-UMOT are estimated on each multi-object sequence to track using the principles of variational inference: Definition of an approximate posterior distribution of the latent variables and maximization of the corresponding evidence lower bound of the data likehood function. DVAE-UMOT is shown experimentally to compete well with and even surpass the performance of two state-of-the-art probabilistic MOT models. Code and data are publicly available.
    Beyond Low-pass Filtering: Graph Convolutional Networks with Automatic Filtering. (arXiv:2107.04755v2 [cs.LG] UPDATED)
    Graph convolutional networks are becoming indispensable for deep learning from graph-structured data. Most of the existing graph convolutional networks share two big shortcomings. First, they are essentially low-pass filters, thus the potentially useful middle and high frequency band of graph signals are ignored. Second, the bandwidth of existing graph convolutional filters is fixed. Parameters of a graph convolutional filter only transform the graph inputs without changing the curvature of a graph convolutional filter function. In reality, we are uncertain about whether we should retain or cut off the frequency at a certain point unless we have expert domain knowledge. In this paper, we propose Automatic Graph Convolutional Networks (AutoGCN) to capture the full spectrum of graph signals and automatically update the bandwidth of graph convolutional filters. While it is based on graph spectral theory, our AutoGCN is also localized in space and has a spatial form. Experimental results show that AutoGCN achieves significant improvement over baseline methods which only work as low-pass filters.
    An Integrated Optimization and Machine Learning Models to Predict the Admission Status of Emergency Patients. (arXiv:2202.09196v1 [cs.LG])
    This work proposes a framework for optimizing machine learning algorithms. The practicality of the framework is illustrated using an important case study from the healthcare domain, which is predicting the admission status of emergency department (ED) patients (e.g., admitted vs. discharged) using patient data at the time of triage. The proposed framework can mitigate the crowding problem by proactively planning the patient boarding process. A large retrospective dataset of patient records is obtained from the electronic health record database of all ED visits over three years from three major locations of a healthcare provider in the Midwest of the US. Three machine learning algorithms are proposed: T-XGB, T-ADAB, and T-MLP. T-XGB integrates extreme gradient boosting (XGB) and Tabu Search (TS), T-ADAB integrates Adaboost and TS, and T-MLP integrates multi-layer perceptron (MLP) and TS. The proposed algorithms are compared with the traditional algorithms: XGB, ADAB, and MLP, in which their parameters are tunned using grid search. The three proposed algorithms and the original ones are trained and tested using nine data groups that are obtained from different feature selection methods. In other words, 54 models are developed. Performance was evaluated using five measures: Area under the curve (AUC), sensitivity, specificity, F1, and accuracy. The results show that the newly proposed algorithms resulted in high AUC and outperformed the traditional algorithms. The T-ADAB performs the best among the newly developed algorithms. The AUC, sensitivity, specificity, F1, and accuracy of the best model are 95.4%, 99.3%, 91.4%, 95.2%, 97.2%, respectively.
    Audio inpainting of music by means of neural networks. (arXiv:1810.12138v3 [cs.SD] UPDATED)
    We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps. The input to the DNN was the context, i.e., the signal surrounding the gap, transformed into time-frequency (TF) coefficients. Our results were compared to those obtained from a reference method based on linear predictive coding (LPC). For music, our DNN significantly outperformed the reference method, demonstrating a generally good usability of the proposed DNN structure for inpainting complex audio signals like music.
    Molecule Generation for Drug Design: a Graph Learning Perspective. (arXiv:2202.09212v1 [cs.LG])
    Machine learning has revolutionized many fields, and graph learning is recently receiving increasing attention. From the application perspective, one of the emerging and attractive areas is aiding the design and discovery of molecules, especially in drug industry. In this survey, we provide an overview of the state-of-the-art molecule (and mostly for de novo drug) design and discovery aiding methods whose methodology involves (deep) graph learning. Specifically, we propose to categorize these methods into three groups: i) all at once, ii) fragment-based and iii) node-by-node. We further present some representative public datasets and summarize commonly utilized evaluation metrics for generation and optimization, respectively. Finally, we discuss challenges and directions for future research, from the drug design perspective.
    DataMUX: Data Multiplexing for Neural Networks. (arXiv:2202.09318v1 [cs.LG])
    In this paper, we introduce data multiplexing (DataMUX), a technique that enables deep neural networks to process multiple inputs simultaneously using a single compact representation. DataMUX demonstrates that neural networks are capable of generating accurate predictions over mixtures of inputs, resulting in increased throughput with minimal extra memory requirements. Our approach uses two key components -- 1) a multiplexing layer that performs a fixed linear transformation to each input before combining them to create a mixed representation of the same size as a single input, which is then processed by the base network, and 2) a demultiplexing layer that converts the base network's output back into independent representations before producing predictions for each input. We show the viability of DataMUX for different architectures (Transformers, and to a lesser extent MLPs and CNNs) across six different tasks spanning sentence classification, named entity recognition and image classification. For instance, DataMUX for Transformers can multiplex up to $20$x/$40$x inputs, achieving $11$x/$18$x increase in throughput with minimal absolute performance drops of $<2\%$ and $<4\%$ respectively on MNLI, a natural language inference task. We also provide a theoretical construction for multiplexing in self-attention networks and analyze the effect of various design elements in DataMUX.
    Efficient computation of the volume of a polytope in high-dimensions using Piecewise Deterministic Markov Processes. (arXiv:2202.09129v1 [stat.CO])
    Computing the volume of a polytope in high dimensions is computationally challenging but has wide applications. Current state-of-the-art algorithms to compute such volumes rely on efficient sampling of a Gaussian distribution restricted to the polytope, using e.g. Hamiltonian Monte Carlo. We present a new sampling strategy that uses a Piecewise Deterministic Markov Process. Like Hamiltonian Monte Carlo, this new method involves simulating trajectories of a non-reversible process and inherits similar good mixing properties. However, importantly, the process can be simulated more easily due to its piecewise linear trajectories - and this leads to a reduction of the computational cost by a factor of the dimension of the space. Our experiments indicate that our method is numerically robust and is one order of magnitude faster (or better) than existing methods using Hamiltonian Monte Carlo. On a single core processor, we report computational time of a few minutes up to dimension 500.
    Machine learning models and facial regions videos for estimating heart rate: a review on Patents, Datasets and Literature. (arXiv:2202.08913v1 [cs.LG])
    Estimating heart rate is important for monitoring users in various situations. Estimates based on facial videos are increasingly being researched because it makes it possible to monitor cardiac information in a non-invasive way and because the devices are simpler, requiring only cameras that capture the user's face. From these videos of the user's face, machine learning is able to estimate heart rate. This study investigates the benefits and challenges of using machine learning models to estimate heart rate from facial videos, through patents, datasets, and articles review. We searched Derwent Innovation, IEEE Xplore, Scopus, and Web of Science knowledge bases and identified 7 patent filings, 11 datasets, and 20 articles on heart rate, photoplethysmography, or electrocardiogram data. In terms of patents, we note the advantages of inventions related to heart rate estimation, as described by the authors. In terms of datasets, we discovered that most of them are for academic purposes and with different signs and annotations that allow coverage for subjects other than heartbeat estimation. In terms of articles, we discovered techniques, such as extracting regions of interest for heart rate reading and using Video Magnification for small motion extraction, and models such as EVM-CNN and VGG-16, that extract the observed individual's heart rate, the best regions of interest for signal extraction and ways to process them.
    Masked prediction tasks: a parameter identifiability view. (arXiv:2202.09305v1 [cs.LG])
    The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour. In this paper, we present an alternative lens: one of parameter identifiability. More precisely, we consider data coming from a parametric probabilistic model, and train a self-supervised learning predictor with a suitably chosen parametric form. Then, we ask whether we can read off the ground truth parameters of the probabilistic model from the optimal predictor. We focus on the widely used self-supervised learning method of predicting masked tokens, which is popular for both natural languages and visual data. While incarnations of this approach have already been successfully used for simpler probabilistic models (e.g. learning fully-observed undirected graphical models), we focus instead on latent-variable models capturing sequential structures -- namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not. Our results, borne of a theoretical grounding of self-supervised learning, could thus potentially beneficially inform practice. Moreover, we uncover close connections with uniqueness of tensor rank decompositions -- a widely used tool in studying identifiability through the lens of the method of moments.
    Geometric Regularization from Overparameterization explains Double Descent and other findings. (arXiv:2202.09276v1 [cs.LG])
    The volume of the distribution of possible weight configurations associated with a loss value may be the source of implicit regularization from overparameterization due to the phenomenon of contracting volume with increasing dimensions for geometric figures demonstrated by hyperspheres. This paper introduces geometric regularization and explores potential applicability to several unexplained phenomenon including double descent, the differences between wide and deep networks, the benefits of He initialization and retained proximity in training, gradient confusion, fitness landscape properties, double descent in other learning paradigms, and other findings for overparameterized learning. Experiments are conducted by aggregating histograms of loss values corresponding to randomly sampled initializations in small setups, which find directional correlations in zero or central mode dominance from deviations in width, depth, and initialization distributions. Double descent is likely due to a regularization phase change when a training path reaches low enough loss that the loss manifold volume contraction from a reduced range of potential weight sets is amplified by an overparameterized geometry.
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v1 [cs.LG])
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms are designed to take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality, and (2) apply techniques from online learning to learn predictors against adversarial instances, tune robustness-consistency trade-offs, and obtain new statistical guarantees. We demonstrate the effectiveness of our approach at deriving learning algorithms by analyzing methods for bipartite matching, page migration, ski-rental, and job scheduling. In the first and last settings we improve upon existing learning-theoretic results by deriving online results, obtaining better or more general statistical guarantees, and utilizing a much simpler analysis, while in the second and fourth we provide the first learning-theoretic guarantees.  ( 2 min )
    Surf or sleep? Understanding the influence of bedtime patterns on campus. (arXiv:2202.09283v1 [cs.AI])
    Poor sleep habits may cause serious problems of mind and body, and it is a commonly observed issue for college students due to study workload as well as peer and social influence. Understanding its impact and identifying students with poor sleep habits matters a lot in educational management. Most of the current research is either based on self-reports and questionnaires, suffering from a small sample size and social desirability bias, or the methods used are not suitable for the education system. In this paper, we develop a general data-driven method for identifying students' sleep patterns according to their Internet access pattern stored in the education management system and explore its influence from various aspects. First, we design a Possion-based probabilistic mixture model to cluster students according to the distribution of bedtime and identify students who are used to staying up late. Second, we profile students from five aspects (including eight dimensions) based on campus-behavior data and build Bayesian networks to explore the relationship between behavioral characteristics and sleeping habits. Finally, we test the predictability of sleeping habits. This paper not only contributes to the understanding of student sleep from a cognitive and behavioral perspective but also presents a new approach that provides an effective framework for various educational institutions to detect the sleeping patterns of students.  ( 2 min )
    ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!. (arXiv:2202.09357v1 [cs.LG])
    We introduce \algname{ProxSkip} -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (\algname{ProxGD}) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. \algname{ProxSkip} allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\cO(\kappa \log \nicefrac{1}{\varepsilon})$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\cO(\sqrt{\kappa} \log \nicefrac{1}{\varepsilon})$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local \algname{GD} step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, \algname{ProxSkip} offers an effective {\em acceleration} of communication complexity. Unlike other local gradient-type methods, such as \algname{FedAvg}, \algname{SCAFFOLD}, \algname{S-Local-GD} and \algname{FedLin}, whose theoretical communication complexity is worse than, or at best matching, that of vanilla \algname{GD} in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.  ( 2 min )
    (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering. (arXiv:2202.09277v1 [cs.CV])
    Spatio-temporal scene-graph approaches to video-based reasoning tasks such as video question-answering (QA) typically construct such graphs for every video frame. Such approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.  ( 2 min )
    Human experts vs. machines in taxa recognition. (arXiv:1708.06899v5 [stat.ML] UPDATED)
    The step of expert taxa recognition currently slows down the response time of many bioassessments. Shifting to quicker and cheaper state-of-the-art machine learning approaches is still met with expert scepticism towards the ability and logic of machines. In our study, we investigate both the differences in accuracy and in the identification logic of taxonomic experts and machines. We propose a systematic approach utilizing deep Convolutional Neural Nets with the transfer learning paradigm and extensively evaluate it over a multi-pose taxonomic dataset with hierarchical labels specifically created for this comparison. We also study the prediction accuracy on different ranks of taxonomic hierarchy in detail. We used support vector machine classifier as a benchmark. Our results revealed that human experts using actual specimens yield the lowest classification error ($\overline{CE}=6.1\%$). However, a much faster, automated approach using deep Convolutional Neural Nets comes close to human accuracy ($\overline{CE}=11.4\%$) when a typical flat classification approach is used. Contrary to previous findings in the literature, we find that for machines following a typical flat classification approach commonly used in machine learning performs better than forcing machines to adopt a hierarchical, local per parent node approach used by human taxonomic experts ($\overline{CE}=13.8\%$). Finally, we publicly share our unique dataset to serve as a public benchmark dataset in this field.  ( 2 min )
    Rethinking Pareto Frontier for Performance Evaluation of Deep Neural Networks. (arXiv:2202.09275v1 [cs.LG])
    Recent efforts in deep learning show a considerable advancement in redesigning deep learning models for low-resource and edge devices. The performance optimization of deep learning models are conducted either manually or through automatic architecture search, or a combination of both. The throughput and power consumption of deep learning models strongly depend on the target hardware. We propose to use a \emph{multi-dimensional} Pareto frontier to re-define the efficiency measure using a multi-objective optimization, where other variables such as power consumption, latency, and accuracy play a relative role in defining a dominant model. Furthermore, a random version of the multi-dimensional Pareto frontier is introduced to mitigate the uncertainty of accuracy, latency, and throughput variations of deep learning models in different experimental setups. These two breakthroughs provide an objective benchmarking method for a wide range of deep learning models. We run our novel multi-dimensional stochastic relative efficiency on a wide range of deep image classification models trained ImageNet data. Thank to this new approach we combine competing variables with stochastic nature simultaneously in a single relative efficiency measure. This allows to rank deep models that run efficiently on different computing hardware, and combines inference efficiency with training efficiency objectively.  ( 2 min )
    Learning Physics-Informed Neural Networks without Stacked Back-propagation. (arXiv:2202.09340v1 [cs.LG])
    Physics-Informed Neural Network (PINN) has become a commonly used machine learning approach to solve partial differential equations (PDE). But, facing high-dimensional second-order PDE problems, PINN will suffer from severe scalability issues since its loss includes second-order derivatives, the computational cost of which will grow along with the dimension during stacked back-propagation. In this paper, we develop a novel approach that can significantly accelerate the training of Physics-Informed Neural Networks. In particular, we parameterize the PDE solution by the Gaussian smoothed model and show that, derived from Stein's Identity, the second-order derivatives can be efficiently calculated without back-propagation. We further discuss the model capacity and provide variance reduction methods to address key limitations in the derivative estimation. Experimental results show that our proposed method can achieve competitive error compared to standard PINN training but is two orders of magnitude faster.
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v2 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Analogical Proportions. (arXiv:2006.02854v11 [cs.LO] UPDATED)
    Analogy-making is at the core of human and artificial intelligence and creativity with applications to such diverse tasks as proving mathematical theorems and building mathematical theories, common sense reasoning, learning, language acquisition, and story telling. This paper introduces from first principles an abstract algebraic framework of analogical proportions of the form `$a$ is to $b$ what $c$ is to $d$' in the general setting of universal algebra. This enables us to compare mathematical objects possibly across different domains in a uniform way which is crucial for AI-systems. It turns out that our notion of analogical proportions has appealing mathematical properties. As we construct our model from first principles using only elementary concepts of universal algebra, and since our model questions some basic properties of analogical proportions presupposed in the literature, to convince the reader of the plausibility of our model we show that it can be naturally embedded into first-order logic via model-theoretic types and prove from that perspective that analogical proportions are compatible with structure-preserving mappings. This provides conceptual evidence for its applicability. In a broader sense, this paper is a first step towards a theory of analogical reasoning and learning systems with potential applications to fundamental AI-problems like common sense reasoning and computational learning and creativity.  ( 2 min )
    What Doesn't Kill You Makes You Robust(er): How to Adversarially Train against Data Poisoning. (arXiv:2102.13624v2 [cs.LG] UPDATED)
    Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning threat models. Adversarial training, and its variants, are currently considered the only empirically strong defense against (inference-time) adversarial attacks. In this work, we extend the adversarial training framework to defend against (training-time) data poisoning, including targeted and backdoor attacks. Our method desensitizes networks to the effects of such attacks by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses such as DP-SGD or (evasion) adversarial training.  ( 2 min )
    Quantification of Actual Road User Behavior on the Basis of Given Traffic Rules. (arXiv:2202.09269v1 [cs.RO])
    Driving on roads is restricted by various traffic rules, aiming to ensure safety for all traffic participants. However, human road users usually do not adhere to these rules strictly, resulting in varying degrees of rule conformity. Such deviations from given rules are key components of today's road traffic. In autonomous driving, robotic agents can disturb traffic flow, when rule deviations are not taken into account. In this paper, we present an approach to derive the distribution of degrees of rule conformity from human driving data. We demonstrate our method with the Waymo Open Motion dataset and Safety Distance and Speed Limit rules.  ( 2 min )
    Towards a Numerical Proof of Turbulence Closure. (arXiv:2202.09289v1 [physics.flu-dyn])
    The development of turbulence closure models, parametrizing the influence of small non-resolved scales on the dynamics of large resolved ones, is an outstanding theoretical challenge with vast applicative relevance. We present a closure, based on deep recurrent neural networks, that quantitatively reproduces, within statistical errors, Eulerian and Lagrangian structure functions and the intermittent statistics of the energy cascade, including those of subgrid fluxes. To achieve high-order statistical accuracy, and thus a stringent statistical test, we employ shell models of turbulence. Our results encourage the development of similar approaches for 3D Navier-Stokes turbulence.  ( 2 min )
    FinNet: Solving Time-Independent Differential Equations with Finite Difference Neural Network. (arXiv:2202.09282v1 [cs.LG])
    In recent years, deep learning approaches for partial differential equations have received much attention due to their mesh-freeness and other desirable properties. However, most of the works so far concentrated on time-dependent nonlinear differential equations. In this work, we analyze potential issues with the well-known Physic Informed Neural Network for differential equations that are not time-dependent. This analysis motivates us to introduce a novel technique, namely FinNet, for solving differential equations by incorporating finite difference into deep learning. Even though we use a mesh during the training phase, the prediction phase is mesh-free. We illustrate the effectiveness of our method through experiments on solving various equations.
    Curriculum optimization for low-resource speech recognition. (arXiv:2202.08883v1 [eess.AS])
    Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system  ( 2 min )
    Designing Effective Sparse Expert Models. (arXiv:2202.08906v1 [cs.CL])
    Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).  ( 2 min )
    Understanding and Shifting Preferences for Battery Electric Vehicles. (arXiv:2202.08963v1 [cs.IR])
    Identifying personalized interventions for an individual is an important task. Recent work has shown that interventions that do not consider the demographic background of individual consumers can, in fact, produce the reverse effect, strengthening opposition to electric vehicles. In this work, we focus on methods for personalizing interventions based on an individual's demographics to shift the preferences of consumers to be more positive towards Battery Electric Vehicles (BEVs). One of the constraints in building models to suggest interventions for shifting preferences is that each intervention can influence the effectiveness of later interventions. This, in turn, requires many subjects to evaluate effectiveness of each possible intervention. To address this, we propose to identify personalized factors influencing BEV adoption, such as barriers and motivators. We present a method for predicting these factors and show that the performance is better than always predicting the most frequent factors. We then present a Reinforcement Learning (RL) model that learns the most effective interventions, and compare the number of subjects required for each approach.  ( 2 min )
    Adaptivity and Confounding in Multi-Armed Bandit Experiments. (arXiv:2202.09036v1 [cs.LG])
    Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action -- suggesting optimal adaptivity -- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations -- suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.  ( 2 min )
    Probing Pretrained Models of Source Code. (arXiv:2202.08975v1 [cs.SE])
    Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.  ( 2 min )
    Gaussian Mixture Convolution Networks. (arXiv:2202.09153v1 [cs.LG])
    This paper proposes a novel method for deep learning based on the analytical convolution of multidimensional Gaussian mixtures. In contrast to tensors, these do not suffer from the curse of dimensionality and allow for a compact representation, as data is only stored where details exist. Convolution kernels and data are Gaussian mixtures with unconstrained weights, positions, and covariance matrices. Similar to discrete convolutional networks, each convolution step produces several feature channels, represented by independent Gaussian mixtures. Since traditional transfer functions like ReLUs do not produce Gaussian mixtures, we propose using a fitting of these functions instead. This fitting step also acts as a pooling layer if the number of Gaussian components is reduced appropriately. We demonstrate that networks based on this architecture reach competitive accuracy on Gaussian mixtures fitted to the MNIST and ModelNet data sets.  ( 2 min )
    Fairness constraint in Structural Econometrics and Application to fair estimation using Instrumental Variables. (arXiv:2202.08977v1 [econ.EM])
    A supervised machine learning algorithm determines a model from a learning sample that will be used to predict new observations. To this end, it aggregates individual characteristics of the observations of the learning sample. But this information aggregation does not consider any potential selection on unobservables and any status-quo biases which may be contained in the training sample. The latter bias has raised concerns around the so-called \textit{fairness} of machine learning algorithms, especially towards disadvantaged groups. In this chapter, we review the issue of fairness in machine learning through the lenses of structural econometrics models in which the unknown index is the solution of a functional equation and issues of endogeneity are explicitly accounted for. We model fairness as a linear operator whose null space contains the set of strictly {\it fair} indexes. A {\it fair} solution is obtained by projecting the unconstrained index into the null space of this operator or by directly finding the closest solution of the functional equation into this null space. We also acknowledge that policymakers may incur a cost when moving away from the status quo. Achieving \textit{approximate fairness} is obtained by introducing a fairness penalty in the learning procedure and balancing more or less heavily the influence between the status quo and a full fair solution.  ( 2 min )
    Testing the boundaries: Normalizing Flows for higher dimensional data sets. (arXiv:2202.09188v1 [stat.ML])
    Normalizing Flows (NFs) are emerging as a powerful class of generative models, as they not only allow for efficient sampling, but also deliver, by construction, density estimation. They are of great potential usage in High Energy Physics (HEP), where complex high dimensional data and probability distributions are everyday's meal. However, in order to fully leverage the potential of NFs it is crucial to explore their robustness as data dimensionality increases. Thus, in this contribution, we discuss the performances of some of the most popular types of NFs on the market, on some toy data sets with increasing number of dimensions.  ( 2 min )
    Word Embeddings for Automatic Equalization in Audio Mixing. (arXiv:2202.08898v1 [cs.SD])
    In recent years, machine learning has been widely adopted to automate the audio mixing process. Automatic mixing systems have been applied to various audio effects such as gain-adjustment, stereo panning, equalization, and reverberation. These systems can be controlled through visual interfaces, providing audio examples, using knobs, and semantic descriptors. Using semantic descriptors or textual information to control these systems is an effective way for artists to communicate their creative goals. Furthermore, sometimes artists use non-technical words that may not be understood by the mixing system, or even a mixing engineer. In this paper, we explore the novel idea of using word embeddings to represent semantic descriptors. Word embeddings are generally obtained by training neural networks on large corpora of written text. These embeddings serve as the input layer of the neural network to create a translation from words to EQ settings. Using this technique, the machine learning model can also generate EQ settings for semantic descriptors that it has not seen before. We perform experiments to demonstrate the feasibility of this idea. In addition, we compare the EQ settings of humans with the predictions of the neural network to evaluate the quality of predictions. The results showed that the embedding layer enables the neural network to understand semantic descriptors. We observed that the models with embedding layers perform better those without embedding layers, but not as good as human labels.  ( 2 min )
    The Response Shift Paradigm to Quantify Human Trust in AI Recommendations. (arXiv:2202.08979v1 [cs.HC])
    Explainability, interpretability and how much they affect human trust in AI systems are ultimately problems of human cognition as much as machine learning, yet the effectiveness of AI recommendations and the trust afforded by end-users are typically not evaluated quantitatively. We developed and validated a general purpose Human-AI interaction paradigm which quantifies the impact of AI recommendations on human decisions. In our paradigm we confronted human users with quantitative prediction tasks: asking them for a first response, before confronting them with an AI's recommendations (and explanation), and then asking the human user to provide an updated final response. The difference between final and first responses constitutes the shift or sway in the human decision which we use as metric of the AI's recommendation impact on the human, representing the trust they place on the AI. We evaluated this paradigm on hundreds of users through Amazon Mechanical Turk using a multi-branched experiment confronting users with good/poor AI systems that had good, poor or no explainability. Our proof-of-principle paradigm allows one to quantitatively compare the rapidly growing set of XAI/IAI approaches in terms of their effect on the end-user and opens up the possibility of (machine) learning trust.  ( 2 min )
    Can Interpretable Reinforcement Learning Manage Assets Your Way?. (arXiv:2202.09064v1 [cs.LG])
    Personalisation of products and services is fast becoming the driver of success in banking and commerce. Machine learning holds the promise of gaining a deeper understanding of and tailoring to customers' needs and preferences. Whereas traditional solutions to financial decision problems frequently rely on model assumptions, reinforcement learning is able to exploit large amounts of data to improve customer modelling and decision-making in complex financial environments with fewer assumptions. Model explainability and interpretability present challenges from a regulatory perspective which demands transparency for acceptance; they also offer the opportunity for improved insight into and understanding of customers. Post-hoc approaches are typically used for explaining pretrained reinforcement learning models. Based on our previous modeling of customer spending behaviour, we adapt our recent reinforcement learning algorithm that intrinsically characterizes desirable behaviours and we transition to the problem of asset management. We train inherently interpretable reinforcement learning agents to give investment advice that is aligned with prototype financial personality traits which are combined to make a final recommendation. We observe that the trained agents' advice adheres to their intended characteristics, they learn the value of compound growth, and, without any explicit reference, the notion of risk as well as improved policy convergence.  ( 2 min )
    Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods. (arXiv:2202.08907v1 [cs.DS])
    We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.  ( 2 min )
    Rethinking Machine Learning Robustness via its Link with the Out-of-Distribution Problem. (arXiv:2202.08944v1 [cs.LG])
    Despite multiple efforts made towards robust machine learning (ML) models, their vulnerability to adversarial examples remains a challenging problem that calls for rethinking the defense strategy. In this paper, we take a step back and investigate the causes behind ML models' susceptibility to adversarial examples. In particular, we focus on exploring the cause-effect link between adversarial examples and the out-of-distribution (OOD) problem. To that end, we propose an OOD generalization method that stands against both adversary-induced and natural distribution shifts. Through an OOD to in-distribution mapping intuition, our approach translates OOD inputs to the data distribution used to train and test the model. Through extensive experiments on three benchmark image datasets of different scales (MNIST, CIFAR10, and ImageNet) and by leveraging image-to-image translation methods, we confirm that the adversarial examples problem is a special case of the wider OOD generalization problem. Across all datasets, we show that our translation-based approach consistently improves robustness to OOD adversarial inputs and outperforms state-of-the-art defenses by a significant margin, while preserving the exact accuracy on benign (in-distribution) data. Furthermore, our method generalizes on naturally OOD inputs such as darker or sharper images  ( 2 min )
    PGCN: Progressive Graph Convolutional Networks for Spatial-Temporal Traffic Forecasting. (arXiv:2202.08982v1 [cs.LG])
    The complex spatial-temporal correlations in transportation networks make the traffic forecasting problem challenging. Since transportation system inherently possesses graph structures, much research efforts have been put with graph neural networks. Recently, constructing adaptive graphs to the data has shown promising results over the models relying on a single static graph structure. However, the graph adaptations are applied during the training phases, and do not reflect the data used during the testing phases. Such shortcomings can be problematic especially in traffic forecasting since the traffic data often suffers from the unexpected changes and irregularities in the time series. In this study, we propose a novel traffic forecasting framework called Progressive Graph Convolutional Network (PGCN). PGCN constructs a set of graphs by progressively adapting to input data during the training and the testing phases. Specifically, we implemented the model to construct progressive adjacency matrices by learning trend similarities among graph nodes. Then, the model is combined with the dilated causal convolution and gated activation unit to extract temporal features. With residual and skip connections, PGCN performs the traffic prediction. When applied to four real-world traffic datasets of diverse geometric nature, the proposed model achieves state-of-the-art performance with consistency in all datasets. We conclude that the ability of PGCN to progressively adapt to input data enables the model to generalize in different study sites with robustness.  ( 2 min )
    Modelling the semantics of text in complex document layouts using graph transformer networks. (arXiv:2202.09144v1 [cs.CL])
    Representing structured text from complex documents typically calls for different machine learning techniques, such as language models for paragraphs and convolutional neural networks (CNNs) for table extraction, which prohibits drawing links between text spans from different content types. In this article we propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span irrespective of the content type they are found in. We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information, similar to language models that work only on text sequences.  ( 2 min )
    Nonstationary multi-output Gaussian processes via harmonizable spectral mixtures. (arXiv:2202.09233v1 [stat.ML])
    Kernel design for Multi-output Gaussian Processes (MOGP) has received increased attention recently. In particular, the Multi-Output Spectral Mixture kernel (MOSM) arXiv:1709.01298 approach has been praised as a general model in the sense that it extends other approaches such as Linear Model of Corregionalization, Intrinsic Corregionalization Model and Cross-Spectral Mixture. MOSM relies on Cram\'er's theorem to parametrise the power spectral densities (PSD) as a Gaussian mixture, thus, having a structural restriction: by assuming the existence of a PSD, the method is only suited for multi-output stationary applications. We develop a nonstationary extension of MOSM by proposing the family of harmonizable kernels for MOGPs, a class of kernels that contains both stationary and a vast majority of non-stationary processes. A main contribution of the proposed harmonizable kernels is that they automatically identify a possible nonstationary behaviour meaning that practitioners do not need to choose between stationary or non-stationary kernels. The proposed method is first validated on synthetic data with the purpose of illustrating the key properties of our approach, and then compared to existing MOGP methods on two real-world settings from finance and electroencephalography.  ( 2 min )
    Incorporating Texture Information into Dimensionality Reduction for High-Dimensional Images. (arXiv:2202.09179v1 [cs.CV])
    High-dimensional imaging is becoming increasingly relevant in many fields from astronomy and cultural heritage to systems biology. Visual exploration of such high-dimensional data is commonly facilitated by dimensionality reduction. However, common dimensionality reduction methods do not include spatial information present in images, such as local texture features, into the construction of low-dimensional embeddings. Consequently, exploration of such data is typically split into a step focusing on the attribute space followed by a step focusing on spatial information, or vice versa. In this paper, we present a method for incorporating spatial neighborhood information into distance-based dimensionality reduction methods, such as t-Distributed Stochastic Neighbor Embedding (t-SNE). We achieve this by modifying the distance measure between high-dimensional attribute vectors associated with each pixel such that it takes the pixel's spatial neighborhood into account. Based on a classification of different methods for comparing image patches, we explore a number of different approaches. We compare these approaches from a theoretical and experimental point of view. Finally, we illustrate the value of the proposed methods by qualitative and quantitative evaluation on synthetic data and two real-world use cases.  ( 2 min )
    Symphony: Composing Interactive Interfaces for Machine Learning. (arXiv:2202.08946v1 [cs.HC])
    Interfaces for machine learning (ML), information and visualizations about models or data, can help practitioners build robust and responsible ML systems. Despite their benefits, recent studies of ML teams and our interviews with practitioners (n=9) showed that ML interfaces have limited adoption in practice. While existing ML interfaces are effective for specific tasks, they are not designed to be reused, explored, and shared by multiple stakeholders in cross-functional teams. To enable analysis and communication between different ML practitioners, we designed and implemented Symphony, a framework for composing interactive ML interfaces with task-specific, data-driven components that can be used across platforms such as computational notebooks and web dashboards. We developed Symphony through participatory design sessions with 10 teams (n=31), and discuss our findings from deploying Symphony to 3 production ML projects at Apple. Symphony helped ML practitioners discover previously unknown issues like data duplicates and blind spots in models while enabling them to share insights with other stakeholders.  ( 2 min )
    Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation. (arXiv:2202.09198v1 [cs.SD])
    Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.  ( 2 min )
    System Safety and Artificial Intelligence. (arXiv:2202.09292v1 [eess.SY])
    This chapter formulates seven lessons for preventing harm in artificial intelligence (AI) systems based on insights from the field of system safety for software-based automation in safety-critical domains. New applications of AI across societal domains and public organizations and infrastructures come with new hazards, which lead to new forms of harm, both grave and pernicious. The text addresses the lack of consensus for diagnosing and eliminating new AI system hazards. For decades, the field of system safety has dealt with accidents and harm in safety-critical systems governed by varying degrees of software-based automation and decision-making. This field embraces the core assumption of systems and control that AI systems cannot be safeguarded by technical design choices on the model or algorithm alone, instead requiring an end-to-end hazard analysis and design frame that includes the context of use, impacted stakeholders and the formal and informal institutional environment in which the system operates. Safety and other values are then inherently socio-technical and emergent system properties that require design and control measures to instantiate these across the technical, social and institutional components of a system. This chapter honors system safety pioneer Nancy Leveson, by situating her core lessons for today's AI system safety challenges. For every lesson, concrete tools are offered for rethinking and reorganizing the safety management of AI systems, both in design and governance. This history tells us that effective AI safety management requires transdisciplinary approaches and a shared language that allows involvement of all levels of society.  ( 2 min )
    Handling Imbalanced Datasets Through Optimum-Path Forest. (arXiv:2202.08934v1 [cs.LG])
    In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the $\text{O}^2$PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.  ( 2 min )
    Soft Actor-Critic Deep Reinforcement Learning for Fault Tolerant Flight Control. (arXiv:2202.09262v1 [cs.SY])
    Fault-tolerant flight control faces challenges, as developing a model-based controller for each unexpected failure is unrealistic, and online learning methods can handle limited system complexity due to their low sample efficiency. In this research, a model-free coupled-dynamics flight controller for a jet aircraft able to withstand multiple failure types is proposed. An offline trained cascaded Soft Actor-Critic Deep Reinforcement Learning controller is successful on highly coupled maneuvers, including a coordinated 40 degree bank climbing turn with a normalized Mean Absolute Error of 2.64%. The controller is robust to six failure cases, including the rudder jammed at -15 deg, the aileron effectiveness reduced by 70%, a structural failure, icing and a backward c.g. shift as the response is stable and the climbing turn is completed successfully. Robustness to biased sensor noise, atmospheric disturbances, and to varying initial flight conditions and reference signal shapes is also demonstrated.  ( 2 min )
    Deep Interest Highlight Network for Click-Through RatePrediction in Trigger-Induced Recommendation. (arXiv:2202.08959v1 [cs.IR])
    In many classical e-commerce platforms, personalized recommendation has been proven to be of great business value, which can improve user satisfaction and increase the revenue of platforms. In this paper, we present a new recommendation problem, Trigger-Induced Recommendation (TIR), where users' instant interest can be explicitly induced with a trigger item and follow-up related target items are recommended accordingly. TIR has become ubiquitous and popular in e-commerce platforms. In this paper, we figure out that although existing recommendation models are effective in traditional recommendation scenarios by mining users' interests based on their massive historical behaviors, they are struggling in discovering users' instant interests in the TIR scenario due to the discrepancy between these scenarios, resulting in inferior performance. To tackle the problem, we propose a novel recommendation method named Deep Interest Highlight Network (DIHN) for Click-Through Rate (CTR) prediction in TIR scenarios. It has three main components including 1) User Intent Network (UIN), which responds to generate a precise probability score to predict user's intent on the trigger item; 2) Fusion Embedding Module (FEM), which adaptively fuses trigger item and target item embeddings based on the prediction from UIN; and (3) Hybrid Interest Extracting Module (HIEM), which can effectively highlight users' instant interest from their behaviors based on the result of FEM. Extensive offline and online evaluations on a real-world e-commerce platform demonstrate the superiority of DIHN over state-of-the-art methods.  ( 2 min )
    BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs. (arXiv:2202.08884v1 [cs.LG])
    While reinforcement learning (RL) has made great advances in scalability, exploration and partial observability are still active research topics. In contrast, Bayesian RL (BRL) provides a principled answer to both state estimation and the exploration-exploitation trade-off, but struggles to scale. To tackle this challenge, BRL frameworks with various prior assumptions have been proposed, with varied success. This work presents a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella. To demonstrate its practical significance we also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks. Under this parameterization, in contrast to previous work, the belief over the state and dynamics is a more scalable inference problem. We choose actions through Monte-Carlo tree search and empirically show that our method is competitive with state-of-the-art BRL methods on small domains while being able to solve much larger ones.  ( 2 min )
    A Distributed Algorithm for Measure-valued Optimization with Additive Objective. (arXiv:2202.08930v1 [math.OC])
    We propose a distributed nonparametric algorithm for solving measure-valued optimization problems with additive objectives. Such problems arise in several contexts in stochastic learning and control including Langevin sampling from an unnormalized prior, mean field neural network learning and Wasserstein gradient flows. The proposed algorithm comprises a two-layer alternating direction method of multipliers (ADMM). The outer-layer ADMM generalizes the Euclidean consensus ADMM to the Wasserstein consensus ADMM, and to its entropy-regularized version Sinkhorn consensus ADMM. The inner-layer ADMM turns out to be a specific instance of the standard Euclidean ADMM. The overall algorithm realizes operator splitting for gradient flows in the manifold of probability measures.  ( 2 min )
    Tackling benign nonconvexity with smoothing and stochastic gradients. (arXiv:2202.09052v1 [cs.LG])
    Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.  ( 2 min )
    Linearization and Identification of Multiple-Attractors Dynamical System through Laplacian Eigenmaps. (arXiv:2202.09171v1 [cs.LG])
    Dynamical Systems (DS) are fundamental to the modeling and understanding of time evolving phenomena, and find application in physics, biology and control. As determining an analytical description of the dynamics is often difficult, data-driven approaches are preferred for identifying and controlling nonlinear DS with multiple equilibrium points. Identification of such DS has been treated largely as a supervised learning problem. Instead, we focus on a unsupervised learning scenario where we know neither the number nor the type of dynamics. We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data-points belonging to the same dynamics, while preserving the natural temporal evolution. We study the eigenvectors and eigenvalues of the Graph Laplacian and show that they form a set of orthogonal embedding spaces, one for each sub-dynamics. We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear, and n-dimensional embedding where they are quasi-linear. We compare the clustering performance of our algorithm to Kernel K-Means, Spectral Clustering and Gaussian Mixtures and show that, even when these algorithms are provided with the true number of sub-dynamics, they fail to cluster them correctly. We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time through an exponential decaying loss, compared to the state of the art diffeomorphism-based approaches.  ( 2 min )
    Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network. (arXiv:2202.09177v1 [cs.LG])
    Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.  ( 2 min )
    FLAME: Federated Learning Across Multi-device Environments. (arXiv:2202.08922v1 [cs.LG])
    Federated Learning (FL) enables distributed training of machine learning models while keeping personal data on user devices private. While we witness increasing applications of FL in the area of mobile sensing, such as human-activity recognition, FL has not been studied in the context of a multi-device environment (MDE), wherein each user owns multiple data-producing devices. With the proliferation of mobile and wearable devices, MDEs are increasingly becoming popular in ubicomp settings, therefore necessitating the study of FL in them. FL in MDEs is characterized by high non-IID-ness across clients, complicated by the presence of both user and device heterogeneities. Further, ensuring efficient utilization of system resources on FL clients in a MDE remains an important challenge. In this paper, we propose FLAME, a user-centered FL training approach to counter statistical and system heterogeneity in MDEs, and bring consistency in inference performance across devices. FLAME features (i) user-centered FL training utilizing the time alignment across devices from the same user; (ii) accuracy- and efficiency-aware device selection; and (iii) model personalization to devices. We also present an FL evaluation testbed with realistic energy drain and network bandwidth profiles, and a novel class-based data partitioning scheme to extend existing HAR datasets to a federated setup. Our experiment results on three multi-device HAR datasets show that FLAME outperforms various baselines by 4.8-33.8% higher F-1 score, 1.02-2.86x greater energy efficiency, and up to 2.02x speedup in convergence to target accuracy through fair distribution of the FL workload.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v1 [cs.CV])
    Object-centric representation is an essential abstraction for physical reasoning and forward prediction. Most existing approaches learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions in complex systems based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, and novel object geometries. Experiments demonstrate the effectiveness of our model to accurately perform forward prediction and learn plannable object-centric representations which can also be used in downstream model-based control tasks.  ( 2 min )
    Stock Embeddings: Learning Distributed Representations for Financial Assets. (arXiv:2202.08968v1 [q-fin.ST])
    Identifying meaningful relationships between the price movements of financial assets is a challenging but important problem in a variety of financial applications. However with recent research, particularly those using machine learning and deep learning techniques, focused mostly on price forecasting, the literature investigating the modelling of asset correlations has lagged somewhat. To address this, inspired by recent successes in natural language processing, we propose a neural model for training stock embeddings, which harnesses the dynamics of historical returns data in order to learn the nuanced relationships that exist between financial assets. We describe our approach in detail and discuss a number of ways that it can be used in the financial domain. Furthermore, we present the evaluation results to demonstrate the utility of this approach, compared to several important benchmarks, in two real-world financial analytics tasks.  ( 2 min )
    Effective Urban Region Representation Learning Using Heterogeneous Urban Graph Attention Network (HUGAT). (arXiv:2202.09021v1 [cs.LG])
    Revealing the hidden patterns shaping the urban environment is essential to understand its dynamics and to make cities smarter. Recent studies have demonstrated that learning the representations of urban regions can be an effective strategy to uncover the intrinsic characteristics of urban areas. However, existing studies lack in incorporating diversity in urban data sources. In this work, we propose heterogeneous urban graph attention network (HUGAT), which incorporates heterogeneity of diverse urban datasets. In HUGAT, heterogeneous urban graph (HUG) incorporates both the geo-spatial and temporal people movement variations in a single graph structure. Given a HUG, a set of meta-paths are designed to capture the rich urban semantics as composite relations between nodes. Region embedding is carried out using heterogeneous graph attention network (HAN). HUGAT is designed to consider multiple learning objectives of city's geo-spatial and mobility variations simultaneously. In our extensive experiments on NYC data, HUGAT outperformed all the state-of-the-art models. Moreover, it demonstrated a robust generalization capability across the various prediction tasks of crime, average personal income, and bike flow as well as the spatial clustering task.  ( 2 min )
    Microplankton life histories revealed by holographic microscopy and deep learning. (arXiv:2202.09046v1 [physics.bio-ph])
    The marine microbial food web plays a central role in the global carbon cycle. Our mechanistic understanding of the ocean, however, is biased towards its larger constituents, while rates and biomass fluxes in the microbial food web are mainly inferred from indirect measurements and ensemble averages. Yet, resolution at the level of the individual microplankton is required to advance our understanding of the oceanic food web. Here, we demonstrate that, by combining holographic microscopy with deep learning, we can follow microplanktons throughout their lifespan, continuously measuring their three dimensional position and dry mass. The deep learning algorithms circumvent the computationally intensive processing of holographic data and allow rapid measurements over extended time periods. This permits us to reliably estimate growth rates, both in terms of dry mass increase and cell divisions, as well as to measure trophic interactions between species such as predation events. The individual resolution provides information about selectivity, individual feeding rates and handling times for individual microplanktons. This method is particularly useful to explore the flux of carbon through micro-zooplankton, the most important and least known group of primary consumers in the global oceans. We exemplify this by detailed descriptions of micro-zooplankton feeding events, cell divisions, and long term monitoring of single cells from division to division.  ( 2 min )
    Energy-Efficient Parking Analytics System using Deep Reinforcement Learning. (arXiv:2202.08973v1 [cs.CV])
    Advances in deep vision techniques and ubiquity of smart cameras will drive the next generation of video analytics. However, video analytics applications consume vast amounts of energy as both deep learning techniques and cameras are power-hungry. In this paper, we focus on a parking video analytics platform and propose RL-CamSleep, a deep reinforcement learning-based technique, to actuate the cameras to reduce the energy footprint while retaining the system's utility. Our key insight is that many video-analytics applications do not always need to be operational, and we can design policies to activate video analytics only when necessary. Moreover, our work is complementary to existing work that focuses on improving hardware and software efficiency. We evaluate our approach on a city-scale parking dataset having 76 streets spread across the city. Our analysis demonstrates how streets have various parking patterns, highlighting the importance of an adaptive policy. Our approach can learn such an adaptive policy that can reduce the average energy consumption by 76.38% and achieve an average accuracy of more than 98% in performing video analytics.  ( 2 min )
    High-performance automatic categorization and attribution of inventory catalogs. (arXiv:2202.08965v1 [cs.IR])
    Techniques of machine learning for automatic text categorization are applied and adapted for the problem of inventory catalog data attribution, with different approaches explored and optimal solution addressing the tradeoff between accuracy and performance is selected.  ( 2 min )
    Critical Checkpoints for Evaluating Defence Models Against Adversarial Attack and Robustness. (arXiv:2202.09039v1 [cs.CR])
    From past couple of years there is a cycle of researchers proposing a defence model for adversaries in machine learning which is arguably defensible to most of the existing attacks in restricted condition (they evaluate on some bounded inputs or datasets). And then shortly another set of researcher finding the vulnerabilities in that defence model and breaking it by proposing a stronger attack model. Some common flaws are been noticed in the past defence models that were broken in very short time. Defence models being broken so easily is a point of concern as decision of many crucial activities are taken with the help of machine learning models. So there is an utter need of some defence checkpoints that any researcher should keep in mind while evaluating the soundness of technique and declaring it to be decent defence technique. In this paper, we have suggested few checkpoints that should be taken into consideration while building and evaluating the soundness of defence models. All these points are recommended after observing why some past defence models failed and how some model remained adamant and proved their soundness against some of the very strong attacks.  ( 2 min )
    A Review on Methods and Applications in Multimodal Deep Learning. (arXiv:2202.09195v1 [cs.LG])
    Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.  ( 2 min )
    DARL1N: Distributed multi-Agent Reinforcement Learning with One-hop Neighbors. (arXiv:2202.09019v1 [cs.MA])
    Most existing multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. Particularly, with the increase of the number of agents, their training costs grow exponentially. In this paper, we address this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that breaks the curse of dimensionality by decoupling the global interactions among agents and restricting information exchanges to one-hop neighbors. Each agent optimizes its action value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying numbers and states of neighbors. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Comparisons with state-of-the-art MARL methods show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases.  ( 2 min )
    PerFED-GAN: Personalized Federated Learning via Generative Adversarial Networks. (arXiv:2202.09155v1 [cs.LG])
    Federated learning is gaining popularity as a distributed machine learning method that can be used to deploy AI-dependent IoT applications while protecting client data privacy and security. Due to the differences of clients, a single global model may not perform well on all clients, so the personalized federated learning method, which trains a personalized model for each client that better suits its individual needs, becomes a research hotspot. Most personalized federated learning research, however, focuses on data heterogeneity while ignoring the need for model architecture heterogeneity. Most existing federated learning methods uniformly set the model architecture of all clients participating in federated learning, which is inconvenient for each client's individual model and local data distribution requirements, and also increases the risk of client model leakage. This paper proposes a federated learning method based on co-training and generative adversarial networks(GANs) that allows each client to design its own model to participate in federated learning training independently without sharing any model architecture or parameter information with other clients or a center. In our experiments, the proposed method outperforms the existing methods in mean test accuracy by 42% when the client's model architecture and data distribution vary significantly.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v1 [cs.LG])
    Parameter estimation in the empirical fields is usually undertaken using parametric models, and such models are convenient because they readily facilitate statistical inference. Unfortunately, they are unlikely to have a sufficiently flexible functional form to be able to adequately model real-world phenomena, and their usage may therefore result in biased estimates and invalid inference. Unfortunately, whilst non-parametric machine learning models may provide the needed flexibility to adapt to the complexity of real-world phenomena, they do not readily facilitate statistical inference, and may still exhibit residual bias. We explore the potential for semiparametric theory (in particular, the Influence Function) to be used to improve neural networks and machine learning algorithms in terms of (a) improving initial estimates without needing more data (b) increasing the robustness of our models, and (c) yielding confidence intervals for statistical inference. We propose a new neural network method MultiNet, which seeks the flexibility and diversity of an ensemble using a single architecture. Results on causal inference tasks indicate that MultiNet yields better performance than other approaches, and that all considered methods are amenable to improvement from semiparametric techniques under certain conditions. In other words, with these techniques we show that we can improve existing neural networks for `free', without needing more data, and without needing to retrain them. Finally, we provide the expression for deriving influence functions for estimands from a general graph, and the code to do so automatically.  ( 2 min )
    Cyclical Focal Loss. (arXiv:2202.08978v1 [cs.CV])
    The cross-entropy softmax loss is the primary loss function used to train deep neural networks. On the other hand, the focal loss function has been demonstrated to provide improved performance when there is an imbalance in the number of training samples in each class, such as in long-tailed datasets. In this paper, we introduce a novel cyclical focal loss and demonstrate that it is a more universal loss function than cross-entropy softmax loss or focal loss. We describe the intuition behind the cyclical focal loss and our experiments provide evidence that cyclical focal loss provides superior performance for balanced, imbalanced, or long-tailed datasets. We provide numerous experimental results for CIFAR-10/CIFAR-100, ImageNet, balanced and imbalanced 4,000 training sample versions of CIFAR-10/CIFAR-100, and ImageNet-LT and Places-LT from the Open Long-Tailed Recognition (OLTR) challenge. Implementing the cyclical focal loss function requires only a few lines of code and does not increase training time. In the spirit of reproducibility, our code is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Out of Distribution Data Detection Using Dropout Bayesian Neural Networks. (arXiv:2202.08985v1 [cs.LG])
    We explore the utility of information contained within a dropout based Bayesian neural network (BNN) for the task of detecting out of distribution (OOD) data. We first show how previous attempts to leverage the randomized embeddings induced by the intermediate layers of a dropout BNN can fail due to the distance metric used. We introduce an alternative approach to measuring embedding uncertainty, justify its use theoretically, and demonstrate how incorporating embedding uncertainty improves OOD data identification across three tasks: image classification, language classification, and malware detection.  ( 2 min )
    Ensemble and Multimodal Approach for Forecasting Cryptocurrency Price. (arXiv:2202.08967v1 [q-fin.ST])
    Since the birth of Bitcoin in 2009, cryptocurrencies have emerged to become a global phenomenon and an important decentralized financial asset. Due to this decentralization, the value of these digital currencies against fiat currencies is highly volatile over time. Therefore, forecasting the crypto-fiat currency exchange rate is an extremely challenging task. For reliable forecasting, this paper proposes a multimodal AdaBoost-LSTM ensemble approach that employs all modalities which derive price fluctuation such as social media sentiments, search volumes, blockchain information, and trading data. To better support investment decision making, the approach forecasts also the fluctuation distribution. The conducted extensive experiments demonstrated the effectiveness of relying on multimodalities instead of only trading data. Further experiments demonstrate the outperformance of the proposed approach compared to existing tools and methods with a 19.29% improvement.  ( 2 min )
    Combining Varied Learners for Binary Classification using Stacked Generalization. (arXiv:2202.08910v1 [cs.LG])
    The Machine Learning has various learning algorithms that are better in some or the other aspect when compared with each other but a common error that all algorithms will suffer from is training data with very high dimensional feature set. This usually ends up algorithms into generalization error that deplete the performance. This can be solved using an Ensemble Learning method known as Stacking commonly termed as Stacked Generalization. In this paper we perform binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset and prove the point that model becomes generalized and metrics improve significantly. The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.  ( 2 min )
    RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing. (arXiv:2202.08862v1 [cs.SD])
    We present RemixIT, a simple yet effective selfsupervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent to clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.  ( 2 min )
    Generalizing Aggregation Functions in GNNs:High-Capacity GNNs via Nonlinear Neighborhood Aggregators. (arXiv:2202.09145v1 [cs.LG])
    Graph neural networks (GNNs) have achieved great success in many graph learning tasks. The main aspect powering existing GNNs is the multi-layer network architecture to learn the nonlinear graph representations for the specific learning tasks. The core operation in GNNs is message propagation in which each node updates its representation by aggregating its neighbors' representations. Existing GNNs mainly adopt either linear neighborhood aggregation (mean,sum) or max aggregator in their message propagation. (1) For linear aggregators, the whole nonlinearity and network's capacity of GNNs are generally limited due to deeper GNNs usually suffer from over-smoothing issue. (2) For max aggregator, it usually fails to be aware of the detailed information of node representations within neighborhood. To overcome these issues, we re-think the message propagation mechanism in GNNs and aim to develop the general nonlinear aggregators for neighborhood information aggregation in GNNs. One main aspect of our proposed nonlinear aggregators is that they provide the optimally balanced aggregators between max and mean/sum aggregations. Thus, our aggregators can inherit both (i) high nonlinearity that increases network's capacity and (ii) detail-sensitivity that preserves the detailed information of representations together in GNNs' message propagation. Promising experiments on several datasets show the effectiveness of the proposed nonlinear aggregators.  ( 2 min )
    Fast online inference for nonlinear contextual bandit based on Generative Adversarial Network. (arXiv:2202.08867v1 [cs.LG])
    This work addresses the efficiency concern on inferring a nonlinear contextual bandit when the number of arms $n$ is very large. We propose a neural bandit model with an end-to-end training process to efficiently perform bandit algorithms such as Thompson Sampling and UCB during inference. We advance state-of-the-art time complexity to $O(\log n)$ with approximate Bayesian inference, neural random feature mapping, approximate global maxima and approximate nearest neighbor search. We further propose a generative adversarial network to shift the bottleneck of maximizing the objective for selecting optimal arms from inference time to training time, enjoying significant speedup with additional advantage of enabling batch and parallel processing. %The generative model can inference an approximate argmax of the posterior sampling in logarithmic time complexity with the help of approximate nearest neighbor search. Extensive experiments on classification and recommendation tasks demonstrate order-of-magnitude improvement in inference time no significant degradation on the performance.  ( 2 min )
    GNN-Surrogate: A Hierarchical and Adaptive Graph Neural Network for Parameter Space Exploration of Unstructured-Mesh Ocean Simulations. (arXiv:2202.08956v1 [physics.ao-ph])
    We propose GNN-Surrogate, a graph neural network-based surrogate model to explore the parameter space of ocean climate simulations. Parameter space exploration is important for domain scientists to understand the influence of input parameters (e.g., wind stress) on the simulation output (e.g., temperature). The exploration requires scientists to exhaust the complicated parameter space by running a batch of computationally expensive simulations. Our approach improves the efficiency of parameter space exploration with a surrogate model that predicts the simulation outputs accurately and efficiently. Specifically, GNN-Surrogate predicts the output field with given simulation parameters so scientists can explore the simulation parameter space with visualizations from user-specified visual mappings. Moreover, our graph-based techniques are designed for unstructured meshes, making the exploration of simulation outputs on irregular grids efficient. For efficient training, we generate hierarchical graphs and use adaptive resolutions. We give quantitative and qualitative evaluations on the MPAS-Ocean simulation to demonstrate the effectiveness and efficiency of GNN-Surrogate. Source code is publicly available at https://github.com/trainsn/GNN-Surrogate.  ( 2 min )
    Improving English to Sinhala Neural Machine Translation using Part-of-Speech Tag. (arXiv:2202.08882v1 [cs.CL])
    The performance of Neural Machine Translation (NMT) depends significantly on the size of the available parallel corpus. Due to this fact, low resource language pairs demonstrate low translation performance compared to high resource language pairs. The translation quality further degrades when NMT is performed for morphologically rich languages. Even though the web contains a large amount of information, most people in Sri Lanka are unable to read and understand English properly. Therefore, there is a huge requirement of translating English content to local languages to share information among locals. Sinhala language is the primary language in Sri Lanka and building an NMT system that can produce quality English to Sinhala translations is difficult due to the syntactic divergence between these two languages under low resource constraints. Thus, in this research, we explore effective methods of incorporating Part of Speech (POS) tags to the Transformer input embedding and positional encoding to further enhance the performance of the baseline English to Sinhala neural machine translation model.  ( 2 min )
    Interpolation and Regularization for Causal Learning. (arXiv:2202.09054v1 [stat.ML])
    We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the *causal risk* of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of *confounding strength*. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite, and learning from observational data is actively harmful.  ( 2 min )
  • Open

    A global convergence theory for deep ReLU implicit networks via over-parameterization. (arXiv:2110.05645v2 [cs.LG] UPDATED)
    Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.  ( 2 min )
    Fast Rank-1 NMF for Missing Data with KL Divergence. (arXiv:2110.12595v2 [stat.ML] UPDATED)
    We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.  ( 2 min )
    Conditional independence by typing. (arXiv:2010.11887v2 [cs.PL] UPDATED)
    A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference. We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships, and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model. As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination. We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan.  ( 2 min )
    A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations. (arXiv:2106.03480v2 [stat.ML] UPDATED)
    We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. Indeed, we prove that the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. Even stronger, we prove that the kernel space is isometric to the space of causal ancestral graphs, so that distance between samples in the kernel space is guaranteed to correspond to distance between their generating causal structures. This kernel thus enables us to perform clustering to identify the homogeneous subpopulations, for which we can then learn causal structures using existing methods. Though we focus on the theoretical aspects of the kernel, we also evaluate its performance on synthetic data and demonstrate its use on a real gene expression data set.  ( 2 min )
    Multi-Level Local SGD for Heterogeneous Hierarchical Networks. (arXiv:2007.13819v3 [cs.LG] UPDATED)
    We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.  ( 2 min )
    Variational Diffusion Models. (arXiv:2107.00630v3 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.  ( 2 min )
    Strategyproof Learning: Building Trustworthy User-Generated Datasets. (arXiv:2106.02398v2 [cs.LG] UPDATED)
    We prove in this paper that, perhaps surprisingly, incentivizing data misreporting is not a fatality. By leveraging a careful design of the loss function, we propose Licchavi, a global and personalized learning framework with provable strategyproofness guarantees. Essentially, we prove that no user can gain much by replying to Licchavi's queries with answers that deviate from their true preferences. Interestingly, Licchavi also promotes the desirable "one person, one unit-force vote" fairness principle. Furthermore, our empirical evaluation of its performance showcases Licchavi's real-world applicability. We believe that our results are critical for the safety of any learning scheme that leverages user-generated data.  ( 2 min )
    When OT meets MoM: Robust estimation of Wasserstein Distance. (arXiv:2006.10325v3 [stat.ML] UPDATED)
    Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage Medians of Means (MoM) estimators to robustify the estimation of Wasserstein distance. Exploiting the dual Kantorovitch formulation of Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. These MoM estimators enable to make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Eventually, we discuss how to combine MoM with the entropy-regularized approximation of the Wasserstein distance and propose a simple MoM-based re-weighting scheme that could be used in conjunction with the Sinkhorn algorithm.  ( 2 min )
    Federated Asymptotics: a model to compare federated learning algorithms. (arXiv:2108.07313v3 [cs.LG] UPDATED)
    We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v1 [cs.LG])
    Parameter estimation in the empirical fields is usually undertaken using parametric models, and such models are convenient because they readily facilitate statistical inference. Unfortunately, they are unlikely to have a sufficiently flexible functional form to be able to adequately model real-world phenomena, and their usage may therefore result in biased estimates and invalid inference. Unfortunately, whilst non-parametric machine learning models may provide the needed flexibility to adapt to the complexity of real-world phenomena, they do not readily facilitate statistical inference, and may still exhibit residual bias. We explore the potential for semiparametric theory (in particular, the Influence Function) to be used to improve neural networks and machine learning algorithms in terms of (a) improving initial estimates without needing more data (b) increasing the robustness of our models, and (c) yielding confidence intervals for statistical inference. We propose a new neural network method MultiNet, which seeks the flexibility and diversity of an ensemble using a single architecture. Results on causal inference tasks indicate that MultiNet yields better performance than other approaches, and that all considered methods are amenable to improvement from semiparametric techniques under certain conditions. In other words, with these techniques we show that we can improve existing neural networks for `free', without needing more data, and without needing to retrain them. Finally, we provide the expression for deriving influence functions for estimands from a general graph, and the code to do so automatically.  ( 2 min )
    Federated Multi-Task Learning under a Mixture of Distributions. (arXiv:2108.10252v3 [cs.LG] UPDATED)
    The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.  ( 2 min )
    Churn modeling of life insurance policies via statistical and machine learning methods -- Analysis of important features. (arXiv:2202.09182v1 [stat.ML])
    Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work, the individual contract cancellation behavior of customers within two partial stocks is modeled by the aid of various classification methods. Partial stocks of private pension and endowment policy are considered. We describe the data used for the modeling, their structured and in which way they are cleansed. The utilized models are calibrated on the basis of an extensive tuning process, then graphically evaluated regarding their goodness-of-fit and with the help of a variable relevance concept, we investigate which features notably affect the individual contract cancellation behavior.  ( 2 min )
    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v1 [cs.LG])
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.  ( 2 min )
    Masked prediction tasks: a parameter identifiability view. (arXiv:2202.09305v1 [cs.LG])
    The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour. In this paper, we present an alternative lens: one of parameter identifiability. More precisely, we consider data coming from a parametric probabilistic model, and train a self-supervised learning predictor with a suitably chosen parametric form. Then, we ask whether we can read off the ground truth parameters of the probabilistic model from the optimal predictor. We focus on the widely used self-supervised learning method of predicting masked tokens, which is popular for both natural languages and visual data. While incarnations of this approach have already been successfully used for simpler probabilistic models (e.g. learning fully-observed undirected graphical models), we focus instead on latent-variable models capturing sequential structures -- namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not. Our results, borne of a theoretical grounding of self-supervised learning, could thus potentially beneficially inform practice. Moreover, we uncover close connections with uniqueness of tensor rank decompositions -- a widely used tool in studying identifiability through the lens of the method of moments.  ( 2 min )
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v2 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Hyperparameter Selection Methods for Fitted Q-Evaluation with Error Guarantee. (arXiv:2201.02300v2 [stat.ML] UPDATED)
    We are concerned with the problem of hyperparameter selection for the fitted Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline policy evaluation (OPE), which is essential to the reinforcement learning without environment simulators. However, like other OPE methods, FQE is not hyperparameter-free itself and that undermines the utility in real-life applications. We address this issue by proposing a framework of approximate hyperparameter selection (AHS) for FQE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity. We also confirm in experiments that the error bound given by the theory matches empirical observations.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v5 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input nonnegative vectors and process them using nonnegative parameters. We first show that nonnegative neural networks can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks is often an interval, which degenerates to a point for the case of scalable networks. The results of this paper contribute to the understanding of the behavior of autoencoders, because the fixed point set of an autoencoder is precisely the set of points that can be perfectly reconstructed. Moreover, they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations, where we consider an autoencoder that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.  ( 2 min )
    Nonstationary multi-output Gaussian processes via harmonizable spectral mixtures. (arXiv:2202.09233v1 [stat.ML])
    Kernel design for Multi-output Gaussian Processes (MOGP) has received increased attention recently. In particular, the Multi-Output Spectral Mixture kernel (MOSM) arXiv:1709.01298 approach has been praised as a general model in the sense that it extends other approaches such as Linear Model of Corregionalization, Intrinsic Corregionalization Model and Cross-Spectral Mixture. MOSM relies on Cram\'er's theorem to parametrise the power spectral densities (PSD) as a Gaussian mixture, thus, having a structural restriction: by assuming the existence of a PSD, the method is only suited for multi-output stationary applications. We develop a nonstationary extension of MOSM by proposing the family of harmonizable kernels for MOGPs, a class of kernels that contains both stationary and a vast majority of non-stationary processes. A main contribution of the proposed harmonizable kernels is that they automatically identify a possible nonstationary behaviour meaning that practitioners do not need to choose between stationary or non-stationary kernels. The proposed method is first validated on synthetic data with the purpose of illustrating the key properties of our approach, and then compared to existing MOGP methods on two real-world settings from finance and electroencephalography.  ( 2 min )
    Human experts vs. machines in taxa recognition. (arXiv:1708.06899v5 [stat.ML] UPDATED)
    The step of expert taxa recognition currently slows down the response time of many bioassessments. Shifting to quicker and cheaper state-of-the-art machine learning approaches is still met with expert scepticism towards the ability and logic of machines. In our study, we investigate both the differences in accuracy and in the identification logic of taxonomic experts and machines. We propose a systematic approach utilizing deep Convolutional Neural Nets with the transfer learning paradigm and extensively evaluate it over a multi-pose taxonomic dataset with hierarchical labels specifically created for this comparison. We also study the prediction accuracy on different ranks of taxonomic hierarchy in detail. We used support vector machine classifier as a benchmark. Our results revealed that human experts using actual specimens yield the lowest classification error ($\overline{CE}=6.1\%$). However, a much faster, automated approach using deep Convolutional Neural Nets comes close to human accuracy ($\overline{CE}=11.4\%$) when a typical flat classification approach is used. Contrary to previous findings in the literature, we find that for machines following a typical flat classification approach commonly used in machine learning performs better than forcing machines to adopt a hierarchical, local per parent node approach used by human taxonomic experts ($\overline{CE}=13.8\%$). Finally, we publicly share our unique dataset to serve as a public benchmark dataset in this field.  ( 2 min )
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v1 [cs.LG])
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms are designed to take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality, and (2) apply techniques from online learning to learn predictors against adversarial instances, tune robustness-consistency trade-offs, and obtain new statistical guarantees. We demonstrate the effectiveness of our approach at deriving learning algorithms by analyzing methods for bipartite matching, page migration, ski-rental, and job scheduling. In the first and last settings we improve upon existing learning-theoretic results by deriving online results, obtaining better or more general statistical guarantees, and utilizing a much simpler analysis, while in the second and fourth we provide the first learning-theoretic guarantees.  ( 2 min )
    Adaptivity and Confounding in Multi-Armed Bandit Experiments. (arXiv:2202.09036v1 [cs.LG])
    Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action -- suggesting optimal adaptivity -- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations -- suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.  ( 2 min )
    On Variance Estimation of Random Forests. (arXiv:2202.09008v1 [stat.ML])
    Ensemble methods based on subsampling, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.  ( 2 min )
    Testing the boundaries: Normalizing Flows for higher dimensional data sets. (arXiv:2202.09188v1 [stat.ML])
    Normalizing Flows (NFs) are emerging as a powerful class of generative models, as they not only allow for efficient sampling, but also deliver, by construction, density estimation. They are of great potential usage in High Energy Physics (HEP), where complex high dimensional data and probability distributions are everyday's meal. However, in order to fully leverage the potential of NFs it is crucial to explore their robustness as data dimensionality increases. Thus, in this contribution, we discuss the performances of some of the most popular types of NFs on the market, on some toy data sets with increasing number of dimensions.  ( 2 min )
    Tackling benign nonconvexity with smoothing and stochastic gradients. (arXiv:2202.09052v1 [cs.LG])
    Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.  ( 2 min )
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v1 [stat.ML])
    We address the differentially private estimation of multiple quantiles (MQ) of a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem and establish that the resulting method is closely related to the current state-of-the-art, the JointExp algorithm, sharing in particular the same computational complexity and a similar efficiency. However, we demonstrate both theoretically and empirically that (non-smoothed) JointExp suffers from an important lack of performance in the case of peaked distributions, with a potentially catastrophic impact in the presence of atoms. While its smoothed version would allow to leverage the performance guarantees of IS, it remains an open challenge to implement. As a proxy to fix the problem we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.  ( 2 min )
    Clustering by Hill-Climbing: Consistency Results. (arXiv:2202.09023v1 [math.ST])
    We consider several hill-climbing approaches to clustering as formulated by Fukunaga and Hostetler in the 1970's. We study both continuous-space and discrete-space (i.e., medoid) variants and establish their consistency.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v1 [cs.CV])
    Object-centric representation is an essential abstraction for physical reasoning and forward prediction. Most existing approaches learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions in complex systems based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, and novel object geometries. Experiments demonstrate the effectiveness of our model to accurately perform forward prediction and learn plannable object-centric representations which can also be used in downstream model-based control tasks.  ( 2 min )
    A Fair Comparison of Graph Neural Networks for Graph Classification. (arXiv:1912.09893v3 [cs.LG] UPDATED)
    Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.  ( 2 min )
    Fairness constraint in Structural Econometrics and Application to fair estimation using Instrumental Variables. (arXiv:2202.08977v1 [econ.EM])
    A supervised machine learning algorithm determines a model from a learning sample that will be used to predict new observations. To this end, it aggregates individual characteristics of the observations of the learning sample. But this information aggregation does not consider any potential selection on unobservables and any status-quo biases which may be contained in the training sample. The latter bias has raised concerns around the so-called \textit{fairness} of machine learning algorithms, especially towards disadvantaged groups. In this chapter, we review the issue of fairness in machine learning through the lenses of structural econometrics models in which the unknown index is the solution of a functional equation and issues of endogeneity are explicitly accounted for. We model fairness as a linear operator whose null space contains the set of strictly {\it fair} indexes. A {\it fair} solution is obtained by projecting the unconstrained index into the null space of this operator or by directly finding the closest solution of the functional equation into this null space. We also acknowledge that policymakers may incur a cost when moving away from the status quo. Achieving \textit{approximate fairness} is obtained by introducing a fairness penalty in the learning procedure and balancing more or less heavily the influence between the status quo and a full fair solution.  ( 2 min )
    Interpolation and Regularization for Causal Learning. (arXiv:2202.09054v1 [stat.ML])
    We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the *causal risk* of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of *confounding strength*. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite, and learning from observational data is actively harmful.  ( 2 min )
    Training neural networks using monotone variational inequality. (arXiv:2202.08876v1 [stat.ML])
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. Inspired by a recent work of (Juditsky & Nemirovsky, 2019), instead of using the traditional loss function minimization approach, we reduce the training of the network parameters to another problem with convex structure -- to solve a monotone variational inequality (MVI). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery accuracy and prediction accuracy under the theoretical setting of training one-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrates its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to the stochastic gradient descent (SGD) on both synthetic and real network data prediction tasks regarding various performance metrics.  ( 2 min )
    Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods. (arXiv:2202.08907v1 [cs.DS])
    We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.  ( 2 min )
  • Open

    Scanner invariant representations for diffusion MRI harmonization
    Introduction  ( 3 min )

  • Open

    [D] Would I be hurting my career by going from MLOps Engineering to SWE?
    Basically I landed my first real ML job just to find out that I'm not really cut out for this right now. I was hired as a junior into a fairly mid level role, and am not doing too hot. I still struggle with all of this AWS stuff, end even more navigating the beurocratic complexities of my org. I was supposed to be "the expert" among a team of PhDs, but fail so miserably. I've been working nights and weekends trying to make these people happy, only to get screamed at during meetings. I can't comment on the nature of my job, but it goes well beyond coding. I want to transition to a non-ML role where I can just put my head down and code. If I do that, would I be hurting my chances of getting back into ML? I know I should probably tough it out and learn more, but I'm so done with these people. submitted by /u/Key_Bee_7957 [link] [comments]  ( 2 min )
    [D] What Repetitive Tasks Related to Machine Learning do You Hate Doing?
    I'm a developer, have some free time and want to build some small utilities and tools. What are things you do a few times a week or more using an existing website or app that kinda sucks or that you have to do manually? I'm thinking maybe tasks where you need to look up stats, pull reports, do calculations or do a boring and manual repetitive task. submitted by /u/AviatorPrints [link] [comments]  ( 1 min )
    [P] Here's our new framework for Privacy Attacks in Federated Learning
    We recently uploaded our framework for privacy attacks in federated learning. You can find it here: https://github.com/JonasGeiping/breaching. We include a sizable number of known attacks for deep neural networks in vision and text domains and some new ones. ​ All of these attacks are based on "gradient inversion". In federated learning, each user receives the state of a machine learning model from the server and then computes a local update (in the simplest case just the gradient of model parameters) based on their private data. This update is then aggregated with other users and sent back to the server.The key question for privacy for this protocol is whether it is possible to recover the original user data from their update. This turns out to be possible in several cases which can be simulated and attacked with this code framework. ​ We also include 26 jupyter notebooks with examples and visualizations of these attacks. We hope this helps in replicating and extending some of the recent work in this topic. Let us know what you think! submitted by /u/JonasGeiping [link] [comments]  ( 1 min )
    [D] When more past information is added to the NN, the prediction gets lazier?
    Hi all, I am currently trying to map the relationship between the input signal and the corresponding velocity of my system using a DNN with 3 layers with 64 neurons each. Basically, the problem can be formulated as follow: Target Input: Current system position, Last velocity values, Desired Velocity. Target Output: Corresponding Input Signal ​ Below, you can see the reference velocity profile (purple) and the velocity profile produced with the predicted input signal (blue). The first image shows the result with 2 last velocity values and the below one without any history of the velocity values. As one can see, the performance is better, when no past velocity values are provided to the DNN at all. I expected exactly the opposite. Also, I realize that the tracking is getting slower (s…  ( 3 min )
    [D] EvoJAX: A Great Framework For Most Deep Tasks
    I have just published my latest medium article. This is about a new framework based on a powerful JAX ecosystem developed in google to accelerate the time consumption of training our deep/machine learning algorithm. We can use EvoJAX in neuroevolution algorithms in a variety of applications from reinforcement learning to computer vision or even for seq2seq applications. Please find the link below and let me know your impression as well. https://medium.com/mlearning-ai/evojax-a-great-framework-for-most-deep-tasks-10adf685c152 submitted by /u/rezayazdanfar [link] [comments]  ( 1 min )
    [D] Sentiment Analysis, Part 1 - A friendly guide to Sentiment Analysis
    Hello everyone 👋 I am super excited to share my first Medium blog post: https://medium.com/besedo-engineering/sentiment-analysis-part-1-a-friendly-guide-to-sentiment-analysis-963e09d9fc9b 🎉 I worked about Sentiment Analysis during my internship at Besedo and I am happy to share aspects of this subject (and my work) with you! If you want to know more about Sentiment Analysis, this blog will gently guide you through the different things you need to know. I hope you'll enjoy it 🙏 Any feedback is also deeply appreciated. Bonus: This blog post is the first of a series of 4 blogs about Sentiment Analysis so stay tuned for more! submitted by /u/jade_mlc [link] [comments]  ( 1 min )
    Anthropic AI residency? [D]
    Does anyone know if Anthropic AI has a residency program this year? I saw they had one last year when they launched. Any info will be appreciated! submitted by /u/macro-2018 [link] [comments]
    [P] Best way to transfer large datasets to Colab or comparable framework
    I am working concurrently with multiple very large datasets (10s-100s of GBs)). I signed up for Colab Pro+ thinking it is the best option. However, I face a significant bottleneck in getting data into Colab. My options all seem very bad: Downloading from AWS (where the data is located) - very slow. Paying for a persistent server. Something like a persistent AWS SageMaker notebook. This is very expensive. With even a mediocre GPU, it comes out to $2k/mo. Uploading data to Google Drive and mounting Drive using below code. This is also surprisingly very slow. Took me an hour just to glob through which files were in google drive - let alone DL them. ​ from google.colab import drive drive.mount('/content/drive') 4. Zipping up all the img files, uploading to Google Drive and then downloading as a single file to Colab, then unzipping. Ver tedious, but it seems DLing single file much faster. What is the best solution here? I am ok with paying for a good option as long as its reasonable. Have any of the numerous MLOps startups developed a good solution to do massive data transfer to/from instances in Colab or otherwise? Thanks! submitted by /u/rsandler [link] [comments]  ( 3 min )
  • Open

    Neural Network issue involving MOSFETs
    This is a final call for any help or advice. I'm currently working on a neural network project that takes inputs of voltage and current nodes of a basic MOSFET transistor circuit and have trianing data of those nodes whereby known different values of W and L (width and length - aspect ratio of MOSFET) are taken for the training data. My Neural Network has to take those voltages and currents as inputs to the MLP NN and be able to predict the outputs W and L. I have tried an example however my code doesn't seem to find any correlation in the data and ends up outputting incorrect values consistently. Is there a specific approach anyone would take to my problem? submitted by /u/basedphrog [link] [comments]  ( 1 min )
    Neural networks and stock predictions
    I need some help. I want to use Jordan Recurrent Neural Network and Elman Recurrent Neural Network to predict some stock market prices (python implementation). Cand you give me some resources or any kind of help with this? I will take anything, sites, yt videos, books, you name it. submitted by /u/IceEast [link] [comments]  ( 1 min )
  • Open

    XR Fuels Consumerism, With A Major Downside
    New study shows that tasks in XR require more effort than real life. Specifically, tasks in AR require a higher mental and physical workload. XR can increase sales, but do we really need better consumerism? Extended reality (XR) is an umbrella term that encompasses Augmented Reality (AR), Virtual Reality (VR), or Mixed Reality (MR). The… Read More »XR Fuels Consumerism, With A Major Downside The post XR Fuels Consumerism, With A Major Downside appeared first on Data Science Central.  ( 4 min )
    Exploring Intelligent Graph and PathQL
    telligence, try IntelligentGraph and PathQL by installing the IntelligentGraph docker image available here. The post Exploring Intelligent Graph and PathQL appeared first on Data Science Central.  ( 3 min )
  • Open

    Can AI Solve the issue of misinformation and removing bias from the news
    Hey Everyone, I wanted to know your opinions on the current bias detection algorithms and how effective they are in understanding how biased a certain news article is ? Do you believe that if we train an AI model, it would be able to detect the positive or negative sentiments presented in the news effectively? submitted by /u/dmr_8 [link] [comments]  ( 2 min )
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Weekly China AI News: Volkswagen Reportedly Acquires Huawei’s Self-driving Unit; Alibaba & Tencent Back Dram Maker; Bilibili AI Makes Old Anime Look Better
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    I'm making art with AI
    Hey guys so I worked on AI bot that works with neural networking to create pictures based on prompts. I can turn other images into other styles if I ask the bot to do it. It's my baby and I named him Xyzzy. I made an instagram account for the images it produces if you guys want to follow it. https://www.instagram.com/sqwaeezyart/ submitted by /u/xToxicsnowflake [link] [comments]  ( 1 min )
    The US Copyright Office says an AI can’t copyright its art
    submitted by /u/josourcing [link] [comments]
    Last Week in AI: DeepMind applies AI to nuclear fusion reactors, Texas sues Meta over facial recognition, SenseTime expands into manufacturing, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Artificial Nightmares : Don't Look Under The Bed 2 || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]  ( 1 min )
    Google AI Introduces 4Ms Approach to Reduce Carbon Emission of Machine Learning Models
    The total quantity of greenhouse gas emissions generated by anything — a person, organization, event, or product – is known as the carbon footprint. The processes producing more carbon footprint use more resources, generating more greenhouse gases and causing greater climate change. Contributing to the small decrease of greenhouse gas emissions can reduce a great amount of overall carbon footprint. With the rising popularity of Machine learning (ML) applications, there are constant concerns about the rising carbon footprint because of increased computation costs. These concerns highlight the need for precise data to determine the true carbon footprint, which can assist in identifying solutions to reduce ML’s carbon emissions. CONTINUE READING Paper: https://www.techrxiv.org/articles/preprint/The\_Carbon\_Footprint\_of\_Machine\_Learning\_Training\_Will\_Plateau\_Then\_Shrink/19139645/1 Google Blog: https://ai.googleblog.com/2022/02/good-news-about-carbon-footprint-of.html https://preview.redd.it/a39nsg0968j81.png?width=1536&format=png&auto=webp&s=613349a361651a3d10a541416162f61fc514f48a submitted by /u/techsucker [link] [comments]  ( 1 min )
    Can machine-learning models overcome biased datasets?
    submitted by /u/qptbook [link] [comments]
    How algorithms come into being
    submitted by /u/bendee983 [link] [comments]
  • Open

    Q codes in Seveneves
    The first time I heard of Q codes was when reading the novel Seveneves by Neal Stephenson. These are three-letter abbreviations using in Morse code that all begin with Q. Since Q is always followed by U in native English words, Q can be used to begin a sort of escape sequence [1]. There are […] Q codes in Seveneves first appeared on John D. Cook.  ( 1 min )
  • Open

    Can machine-learning models overcome biased datasets?
    A model’s ability to generalize is influenced by both the diversity of the data and the way the model is trained, researchers report.  ( 7 min )
  • Open

    Need help!!!
    I am very suck at text align, can someone help me to improve __str__() and policy_to_str() ? TT https://github.com/Kayzwer/Generalized-Policy-Iteration-Research submitted by /u/Professional_Card176 [link] [comments]
2022-03-23T00:53:20.376Z osmosfeed 1.14.4